Building off of the first project’s exploratory data analysis findings, this project will aim to produce regression and machine learning modeling. From the findings of the first project, several conclusions were drawn that shaped the scope of this project:
The results of the analysis may have shown little new correlation between variables and SVI themes, but this is still an interesting finding.
1: The means of SVI scores across themes are similar and close in range.
2: Each theme addresses important vulnerabilities.
3: Little new correlation was found.
Thus, this project brings in two additional datasets to examine new correlations and identify if they could be effective predictors in SVI scores.
During the project’s initial phase, limited novel correlations between demographic variables and SVI themes were identified at the census tract level. There were no demographic variables that were left out of SVI themed groupings, thus indicating that the CDC compiled SVI is a well represented and holistic representation of vulnerability assessment. In the second phase of the project, the goal is to utilize the themed and overall SVI scores in combination with additional datasets to see if predictions can be made about census tract or county vulnerability scores. Specifically, correlations between COVID-19 data, median income, and the SVI dataset will be examined. The objective extends to the development of predictive models designed to pinpoint vulnerable communities in events of external stressors (such as a disease outbreak). The temporal scale of the dataset examines SVI scores prior to and after the COVID-19 pandemic.
SMART Questions:
1: Can we explore the presence of correlations between COVID-19 mortality and elevated Social Vulnerability Index (SVI) scores across SVI themes?
2: Is it possible to assess the influence of COVID-19 on SVI scores (before and after the pandemic)?
3: Does median income impact the SVI scores in a significant way?
4: Can we develop predictive models to identify vulnerable communities following external stressors?
5: Which features are imporant/significant in predicting SVI?
The focus of the economic variable dataset use will be to add in another variable reported by the census as a metric of economic status. Then, it will be combined with other variables selected through feature selection to create a logistic regression model. This model will then be used to create a classification tree algorithm that will predict if a tract is Low risk, Medium-Low risk, Medium-High risk, or High risk. By breaking it down into a category with 4 levels, it can help officials, planners, and first responders focus attention to those classified as the highest risk, as well as see the demographic breakdown at a quick glance from the nodes on the tree.
The primary objective of utilizing the COVID dataset is to incorporate supplementary variables associated with COVID-19, encompassing metrics such as cases and deaths, into the Social Vulnerability Index (SVI) dataset. These additional variables are crucial indicators of the pandemic’s extent and impact. Through meticulous feature selection on the merged dataset, significant factors will be identified. Subsequently, these selected features will be employed to construct a robust logistic regression model capable of predicting the SVI score. This analysis holds the potential to assist policymakers, healthcare professionals, and emergency responders in efficiently allocating resources by precisely identifying areas at the highest risk of COVID-19 transmission and its associated challenges.
The SVI is comprised of 5 total SVI calculations: 4 thematic and 1 overall summary composed by the sum of the themes.
It is constructed by selecting the specific indicator variables within different themes that are chosen to represent the various aspects of vulnerability, enabling this project to examine if any themes leave out variable that could be important. Then Census tracts are ranked within each state, as well as against other states, creating tract rankings ranging from 0 to 1, with higher values indicating greater vulnerability. The CDC states: “For each tract, we generated its percentile rank among all tracts for 1) the 16 individual variables, 2) the four themes, and 3) its overall position.”
Then, these percentiles were summed for each of the four themes, and then ordered to determine theme-specific percentile rankings.
The geographic scale of the data is limited to California census tracts, which allows a detailed analysis of over 9,000 census tracts, hopefully enabling more tailored actions and responses. CA is a state that is prone to natural disasters such as earthquakes, wildfires, and has a very high population, making it an important case study.
The economic data used in this project is Median Income of Households in 2019 acquired from the US Census Bureau for the California Census Tract level.
The COVID-19 data used in this project is acquired from the Roche Data Science Coalition (RDSC) for all 50 states.
Information on cleaning the SVI dataset can be found in Project 1 write-up
econ <- read.csv("ACSDT5Y2020.B19013-Data.csv")
econ <- subset(econ, select = c(GEO_ID, NAME,B19013_001E))
econ <- econ[-c(1, 2), ]
#rename columns
names_to_change <- c("GEO_ID", "NAME", "B19013_001E")
new_names <- c("GEO_ID", "tract", "income")
econ <- setNames(econ, new_names)
# edit GEO_ID to isolate just the number after 1400000US06001400100
econ$GEO_ID <- sub(".*US0*(\\d+)", "\\1", econ$GEO_ID)
SVI_Data <- read.csv("SVI_2020_US.csv")
us_cases <- read.csv('USAFacts/confirmed-covid-19-cases-in-us-by-state-and-county.csv')
us_deaths <- read.csv('USAFacts/confirmed-covid-19-deaths-in-us-by-state-and-county.csv')
Data was cleaned and formatted prior to analysis.
#Import SVI
SVI_Data <- read.csv("SVI_2020_US.csv")
Clean_data <- subset(SVI_Data, select = c(ST,STATE,ST_ABBR,STCNTY,COUNTY,FIPS,LOCATION,AREA_SQMI,EPL_POV150, EPL_UNEMP, EPL_HBURD, EPL_NOHSDP, EPL_UNINSUR, SPL_THEME1, RPL_THEME1, EPL_AGE65, EPL_AGE17, EPL_DISABL, EPL_SNGPNT, EPL_LIMENG, SPL_THEME2, RPL_THEME2, EPL_MINRTY, SPL_THEME3, RPL_THEME3, E_MINRTY, EP_HISP, EP_ASIAN, EP_AIAN, EPL_MUNIT, EPL_MOBILE, EPL_CROWD, EPL_NOVEH, EPL_GROUPQ, SPL_THEME4, RPL_THEME4, SPL_THEMES, RPL_THEMES, E_AGE65, EP_POV150, EP_AGE65, EP_NOHSDP
) )
CA_SVI <- subset(Clean_data, ST_ABBR == "CA")
CA_SVI <- subset(CA_SVI, RPL_THEMES!= -999 )
CA_SVI <- subset(CA_SVI, RPL_THEME1!= -999 )
CA_SVI <- subset(CA_SVI, RPL_THEME2!= -999 )
CA_SVI <- subset(CA_SVI, RPL_THEME3!= -999 )
CA_SVI <- subset(CA_SVI, RPL_THEME4!= -999 )
#Join SVI to econ based on GEO_ID
svi_econ <- merge(econ, CA_SVI, by.x = "GEO_ID", by.y = "FIPS", all.x = TRUE, all.y = TRUE)
#count outliers
total_na_count <- sum(is.na(svi_econ))
#print(total_na_count)
#remove NA
svi_econ <- na.omit(svi_econ)
svi_econ <- svi_econ[svi_econ$income != "-", , drop = FALSE]
svi_econ$income <- as.numeric(as.character(svi_econ$income))
svi_econ <- na.omit(svi_econ)
Histogram of Median Income
ggplot(svi_econ, aes(x = income)) +
geom_histogram(binwidth = 1000, fill = "seagreen", color = "palegreen3", alpha = 0.7) +
labs(title = "Histogram of Median Income of Census Tracts in 2019", x = "Meidan Income", y = "Frequency")
From this plot, we can see that the income data is
skewed right, there are more lower median incomes. For the purposes of
classifying data, this distribution does not necessarily need to be
normally distributed, as long as there are enough variables to represent
each target class.
Looking back at our RPL_THEMES, we can observe its
distribution as well.
ggplot(svi_econ, aes(x = RPL_THEMES)) +
geom_histogram(binwidth = 0.01, fill = "tomato1", color = "tomato4", alpha = 0.7) +
labs(title = "Histogram of SVI Score of Census Tracts in 2020", x = "SVI Score", y = "Frequency")
It is seen that the distribution is skewed left.
Map
In the original project variables were mapped by county. It is possible to map the the income distribution by county to get a visual understanding of its distribution.
map1 <- ggplot(data = econmap) +
geom_sf(aes(fill = income)) +
labs(title = "Median Income by Census Tract: 2019",
fill = "Income in Dollars") +
scale_fill_viridis_c() +
theme_void() +
theme(text = element_text(family = "Avenir"))
map1
It is seen that the majority of census tracts fall in a lower range, with a clustering of higher incomes in the coastal region in the middle of the state.
Examining income’s effect on the SVI score can be done using linear regression to interpret these effects.
library(broom)
library(knitr)
# Scale or normalize the variables
svi_econ$income_scaled <- svi_econ$income / 100000 # Scale income to be between 0 and 1
model_econ <- lm(RPL_THEMES ~ income_scaled, data = svi_econ)
# Tidy up the results using broom
tidy_results <- tidy(model_econ)
#summary(model_econ)
# Print the formatted table
kable(tidy_results, format = "markdown")
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 1.0740184 | 0.0043133 | 249.0021 | 0 |
| income_scaled | -0.5687045 | 0.0046053 | -123.4893 | 0 |
Income_scaled (-0.568704): The estimated change in RPL_THEMES for a one-unit increase in income_scaled (a one unit increase is an increase in $100,000). The coefficient is negative, suggesting a negative relationship between income_scaled and RPL_THEMES.
F-statistic (15250): A test of the overall significance of the model. It compares the fit of the intercept-only model with the fit of the given model. A higher F-statistic and a lower p-value (< 0.05) suggest that at least one variable is significantly related to the dependent variable.
P-value (< 2.2e-16): The p-value associated with the F-statistic is very close to zero, indicating that the overall model is statistically significant.
Conclusion: The results indicate that income (measured by median income of census tracts) is likely a good predictor of SVI risk scores, as it is statistically significant. This answers the SMART question regarding if median income is a statistically significant variable and if it has any significant impact on SVI scores.
To answer the SMART question regarding which features are significant in predicting and modeling SVI, feature selection algorithms were used. Numerous methodologies were used to select relevant and significant features for the linear regression and classification model. This part of the project will walk through this selection process.
Highly correlated variables:
# Convert selected columns to numeric
svi_select <- mutate_all(svi_econ, as.numeric)
# Drop columns with NA values
svi_select <- svi_select %>%
select(everything(), -where(~any(is.na(.))))
# Rename multiple columns simultaneously
colnames(svi_select)[colnames(svi_select) == 'RPL_THEMES'] <- 'SVI'
colnames(svi_select)[colnames(svi_select) == 'EPL_POV150'] <- 'Poverty'
colnames(svi_select)[colnames(svi_select) == 'EPL_HBURD'] <- 'Housing_Cost_Burdened'
colnames(svi_select)[colnames(svi_select) == 'EPL_NOHSDP'] <- 'No_Diploma'
# Create the correlation matrix
cor_matrix <- cor(svi_select, use = "complete.obs")
# Variable of interest
target_variable <- "SVI"
# Extract the correlations with the target variable
cor_with_target <- cor_matrix[target_variable, ]
# Select variables with high correlation
high_correlation_vars <- names(cor_with_target[abs(cor_with_target) > 0.65])
# Print the variables with high correlation
print(high_correlation_vars)
## [1] "income" NA "Poverty"
## [4] "Housing_Cost_Burdened" "No_Diploma" "EPL_UNINSUR"
## [7] "SPL_THEME1" "RPL_THEME1" "SPL_THEME2"
## [10] "RPL_THEME2" "EP_HISP" "EPL_CROWD"
## [13] "SPL_THEME4" "RPL_THEME4" "SPL_THEMES"
## [16] "SVI" "EP_POV150" "EP_NOHSDP"
## [19] "income_scaled"
# Select the relevant columns for correlation
correlation_matrix1 <- svi_select %>%
select(SVI, Poverty, Housing_Cost_Burdened, No_Diploma, income_scaled)
# Calculate the correlation matrix
correlation_matrix <- cor(correlation_matrix1)
loadPkg("corrplot")
corrplot(correlation_matrix, method = "square", type = "lower", col = colorRampPalette((c("#2166AC","#FDDBC7","#B2182B")))(100))
This assessment utilized a correlation matrix to identify which
variables have a strong relationship with the target variable
SVI. The threshold for what counted as a significant
relationship was set to 0.65.
From assessing correlation of variables, only 4 variables:
income_scaled, EPL_POV150,
EPL_HBURD, EPL_NOHSDP had a correlation of
0.65 or higher with the RPL_THEMES variable. This might
suggest that these variables are likely predictors of the outcome SVI
score, but it is important to assess multicolinearity.
Residual plots:
The residuals plots appear to have a normal distribution. It is also important to assess the variables using VIF assessment to address multicolinearity.
predict_svi <- svi_econ %>%
select(RPL_THEMES, income_scaled, EPL_POV150, EPL_HBURD, EPL_NOHSDP)
colnames(predict_svi)[colnames(predict_svi) == 'RPL_THEMES'] <- 'SVI'
colnames(predict_svi)[colnames(predict_svi) == 'EPL_POV150'] <- 'Poverty'
colnames(predict_svi)[colnames(predict_svi) == 'EPL_HBURD'] <- 'Housing_Cost_Burdened'
colnames(predict_svi)[colnames(predict_svi) == 'EPL_NOHSDP'] <- 'No_Diploma'
model <- lm(SVI ~ ., data = predict_svi)
# Check the distribution of residuals
residuals <- residuals(model)
# Residual Plot
par(mfrow = c(2, 2))
#plot(model)
# Q-Q Plot
qqnorm(residuals)
qqline(residuals)
# Kernel Density Plot
plot(density(residuals))
VIF Assessment
model_test <- glm(SVI ~ ., data = predict_svi)
summary(model_test)
##
## Call:
## glm(formula = SVI ~ ., data = predict_svi)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.58346 -0.06916 0.00315 0.07167 0.42826
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.154554 0.010208 15.14 <2e-16 ***
## income_scaled -0.070985 0.006151 -11.54 <2e-16 ***
## Poverty 0.260760 0.007749 33.65 <2e-16 ***
## Housing_Cost_Burdened 0.221818 0.007789 28.48 <2e-16 ***
## No_Diploma 0.407627 0.005192 78.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.01240325)
##
## Null deviance: 723.69 on 8955 degrees of freedom
## Residual deviance: 111.02 on 8951 degrees of freedom
## AIC: -13892
##
## Number of Fisher Scoring iterations: 2
data.frame(vif(model_test))
## vif.model_test.
## income_scaled 4.300141
## Poverty 3.508515
## Housing_Cost_Burdened 3.647747
## No_Diploma 1.981004
A high VIF value could indicate that there is multicolinearity present. VIF values above 5 are considered moderately high, and values above 10 are recommended to not be used. From this analysis, it is seen that VIF values all below 5. All 4 variables were selected from this process.
The stepwise feature selection was ran on the datatset in both directions.
Initial Model
# Create the initial model
initial_model <- lm(RPL_THEMES ~ income_scaled + EPL_POV150 +EPL_UNINSUR +EPL_AGE65 +EPL_AGE17+ EPL_DISABL+ EPL_SNGPNT +EPL_LIMENG + EPL_MINRTY +EPL_UNEMP + EPL_HBURD + EPL_NOHSDP, data = svi_econ)
# Perform stepwise selection (both directions)
stepwise_model <- step(initial_model, direction = "both")
## Start: AIC=-43749.07
## RPL_THEMES ~ income_scaled + EPL_POV150 + EPL_UNINSUR + EPL_AGE65 +
## EPL_AGE17 + EPL_DISABL + EPL_SNGPNT + EPL_LIMENG + EPL_MINRTY +
## EPL_UNEMP + EPL_HBURD + EPL_NOHSDP
##
## Df Sum of Sq RSS AIC
## <none> 67.511 -43749
## - income_scaled 1 0.2624 67.773 -43716
## - EPL_AGE17 1 0.2826 67.793 -43714
## - EPL_MINRTY 1 1.7925 69.303 -43516
## - EPL_AGE65 1 3.3452 70.856 -43318
## - EPL_UNINSUR 1 4.7516 72.262 -43142
## - EPL_UNEMP 1 4.8179 72.329 -43134
## - EPL_LIMENG 1 6.5839 74.095 -42918
## - EPL_HBURD 1 7.2463 74.757 -42838
## - EPL_POV150 1 7.2804 74.791 -42834
## - EPL_DISABL 1 7.3091 74.820 -42830
## - EPL_NOHSDP 1 7.6045 75.115 -42795
## - EPL_SNGPNT 1 8.5891 76.100 -42679
Stepwise Results:
# Display the summary of the selected model
summary(stepwise_model)
##
## Call:
## lm(formula = RPL_THEMES ~ income_scaled + EPL_POV150 + EPL_UNINSUR +
## EPL_AGE65 + EPL_AGE17 + EPL_DISABL + EPL_SNGPNT + EPL_LIMENG +
## EPL_MINRTY + EPL_UNEMP + EPL_HBURD + EPL_NOHSDP, data = svi_econ)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42698 -0.05646 -0.00013 0.05753 0.29479
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.230164 0.010370 -22.195 < 2e-16 ***
## income_scaled -0.030575 0.005186 -5.896 3.87e-09 ***
## EPL_POV150 0.194878 0.006275 31.055 < 2e-16 ***
## EPL_UNINSUR 0.123601 0.004927 25.088 < 2e-16 ***
## EPL_AGE65 0.100585 0.004778 21.051 < 2e-16 ***
## EPL_AGE17 0.025390 0.004150 6.119 9.83e-10 ***
## EPL_DISABL 0.150926 0.004850 31.116 < 2e-16 ***
## EPL_SNGPNT 0.140333 0.004160 33.731 < 2e-16 ***
## EPL_LIMENG 0.188629 0.006387 29.532 < 2e-16 ***
## EPL_MINRTY 0.131280 0.008519 15.409 < 2e-16 ***
## EPL_UNEMP 0.096957 0.003838 25.263 < 2e-16 ***
## EPL_HBURD 0.195848 0.006321 30.982 < 2e-16 ***
## EPL_NOHSDP 0.188922 0.005952 31.739 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08688 on 8943 degrees of freedom
## Multiple R-squared: 0.9067, Adjusted R-squared: 0.9066
## F-statistic: 7244 on 12 and 8943 DF, p-value: < 2.2e-16
Results: All of the coefficients are determined to be significant. The output suggests that the best model, according to AIC, includes all the features. Removing any single feature did not lead to a significant improvement in AIC. The model seems to be highly significant, and the included variables all have statistically significant coefficients. The model also explains a substantial portion of the variability in the dependent variable (RPL_THEMES), as indicated by the high R-squared value of 0.9067.
[Lasso analysis] (https://www.statisticshowto.com/lasso-regression/) is used for variable selection and it introduces a penalty term to the traditional linear regression objective function. It does this by giving a penalty to the less important features, making their influence smaller or even zero to filter out the noise and focus on the most crucial things that affect the target variable.
selected_columns <- c("RPL_THEMES", "income_scaled", "EPL_POV150", "EPL_UNINSUR",
"EPL_AGE65", "EPL_AGE17", "EPL_DISABL",
"EPL_SNGPNT", "EPL_LIMENG", "EPL_MINRTY",
"EPL_UNEMP", "EPL_HBURD", "EPL_NOHSDP")
# Filter the data frame
trim_svi <- svi_econ %>%
select(all_of(selected_columns))
# Prepare data
x <- model.matrix(RPL_THEMES ~ . - 1, data = trim_svi)
y <- svi_econ$RPL_THEMES
# Fit LASSO regression model
fit_lasso <- cv.glmnet(x, y, alpha = 1)
# Display coefficients of the LASSO model
coef_lasso <- coef(fit_lasso, s = fit_lasso$lambda.min)
print(coef_lasso)
## 13 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -0.21752984
## income_scaled -0.03103186
## EPL_POV150 0.19490422
## EPL_UNINSUR 0.12096308
## EPL_AGE65 0.09326387
## EPL_AGE17 0.02160010
## EPL_DISABL 0.15030534
## EPL_SNGPNT 0.13961008
## EPL_LIMENG 0.18789132
## EPL_MINRTY 0.12494756
## EPL_UNEMP 0.09555821
## EPL_HBURD 0.19381535
## EPL_NOHSDP 0.19188624
# Fit Ridge regression model
#fit_ridge <- cv.glmnet(x, y, alpha = 0)
# Display coefficients of the Ridge model
#coef_ridge <- coef(fit_ridge, s = fit_ridge$lambda.min)
#print(coef_ridge)
#optimal_lambda <- fit_lasso$lambda.min
#coefficients_at_optimal_lambda <- coef(fit_lasso, s = optimal_lambda)
#print(coefficients_at_optimal_lambda)
Results: All of the coefficients turn out to be non zero, but some are slightly lower than the others. However, due to the nature of this analysis, the coefficients are deemed meaningful if they are non-zero values. Interpretation of the coefficients can be understood as:
Income: The coefficient is -0.217, suggesting a one-unit increase in income is associated with a decrease of approximately -0.217 units in the response variable, holding other variables constant.
EPL_POV150: The coefficient is 0.1949, suggesting a one-unit increase in EPL_POV150 is associated with an increase of approximately 0.1949 units in the response variable, holding other variables constant.
Out of all of the variables, it appears EPL_AGE65
(0.09326387), EPL_AGE17 (0.02160010), and
EPL_UNEMP (0.09555821) all have the smallest impact on
overall SVI, but are still significant in the model.
Therefore, it appears that LASSO and Stepwise both selected the full model as the best model. In the next steps, a comparison of the simpler linear regression model and the full regression model will be carried out.
This answers the SMART question regarding which variables are significant in modeling and predictign SVI.
We begin the classification process by examining the relationships of these variables in a linear regression to identify which model is best to use out of the two identified models from feature selection. This is done by examining model preformance and characteristics.
The simple model represents the linear regression modeled by: RPL_THEMES ~ income_scaled + EPL_POV150 + EPL_HBURD + EPL_NOHSDP, as identified by the correlation matrix.
simple_model <- lm(RPL_THEMES ~ income_scaled + EPL_POV150 + EPL_HBURD + EPL_NOHSDP, data = svi_econ)
summary(simple_model)
##
## Call:
## lm(formula = RPL_THEMES ~ income_scaled + EPL_POV150 + EPL_HBURD +
## EPL_NOHSDP, data = svi_econ)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58346 -0.06916 0.00315 0.07167 0.42826
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.154554 0.010208 15.14 <2e-16 ***
## income_scaled -0.070985 0.006151 -11.54 <2e-16 ***
## EPL_POV150 0.260760 0.007749 33.65 <2e-16 ***
## EPL_HBURD 0.221818 0.007789 28.48 <2e-16 ***
## EPL_NOHSDP 0.407627 0.005192 78.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1114 on 8951 degrees of freedom
## Multiple R-squared: 0.8466, Adjusted R-squared: 0.8465
## F-statistic: 1.235e+04 on 4 and 8951 DF, p-value: < 2.2e-16
full_model <- lm(RPL_THEMES ~ income_scaled + EPL_POV150 +EPL_UNINSUR +EPL_AGE65 +EPL_AGE17+ EPL_DISABL+ EPL_SNGPNT +EPL_LIMENG + EPL_MINRTY +EPL_UNEMP + EPL_HBURD + EPL_NOHSDP, data = svi_econ)
In this model, all of the variables are significant with low p-values. This model has an R-squared of 0.846, suggesting a moderate fit.
The full model represents the model utilizing all of the variables.
summary(full_model)
##
## Call:
## lm(formula = RPL_THEMES ~ income_scaled + EPL_POV150 + EPL_UNINSUR +
## EPL_AGE65 + EPL_AGE17 + EPL_DISABL + EPL_SNGPNT + EPL_LIMENG +
## EPL_MINRTY + EPL_UNEMP + EPL_HBURD + EPL_NOHSDP, data = svi_econ)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42698 -0.05646 -0.00013 0.05753 0.29479
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.230164 0.010370 -22.195 < 2e-16 ***
## income_scaled -0.030575 0.005186 -5.896 3.87e-09 ***
## EPL_POV150 0.194878 0.006275 31.055 < 2e-16 ***
## EPL_UNINSUR 0.123601 0.004927 25.088 < 2e-16 ***
## EPL_AGE65 0.100585 0.004778 21.051 < 2e-16 ***
## EPL_AGE17 0.025390 0.004150 6.119 9.83e-10 ***
## EPL_DISABL 0.150926 0.004850 31.116 < 2e-16 ***
## EPL_SNGPNT 0.140333 0.004160 33.731 < 2e-16 ***
## EPL_LIMENG 0.188629 0.006387 29.532 < 2e-16 ***
## EPL_MINRTY 0.131280 0.008519 15.409 < 2e-16 ***
## EPL_UNEMP 0.096957 0.003838 25.263 < 2e-16 ***
## EPL_HBURD 0.195848 0.006321 30.982 < 2e-16 ***
## EPL_NOHSDP 0.188922 0.005952 31.739 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08688 on 8943 degrees of freedom
## Multiple R-squared: 0.9067, Adjusted R-squared: 0.9066
## F-statistic: 7244 on 12 and 8943 DF, p-value: < 2.2e-16
In this model, all of the variables are significant with low p-values. This model has an R-squared of 0.9067, suggesting a better fit than the simple model, with a good fit.
# Assessment for initial_model
#summary(simple_model)
#plot(simple_model)
# Assessment for initial_model2
#summary(full_model)
#plot(full_model)
# Calculate AIC for initial_model
aic_simple_model <- AIC(simple_model)
# Calculate AIC for initial_model2
aic_full_model <- AIC(full_model)
# Compare AIC values
#cat("AIC for simple_model:", aic_simple_model, "\n")
#cat("AIC for full_model:", aic_full_model, "\n")
Assessing the AIC values of both models, it is seen that the AIC for the simple model is -13891.99 while the AIC for the full model is -18331.04. Therefore, it is recommended to use the model with the lower AIC, as it is the better model.
# Calculate Adjusted R-squared for initial_model
adj_rsq_simple_model <- summary(simple_model)$adj.r.squared
# Calculate Adjusted R-squared for initial_model2
adj_rsq_full_model2 <- summary(full_model)$adj.r.squared
# Compare Adjusted R-squared values
cat("Adjusted R-squared for simple:", adj_rsq_simple_model, "\n")
cat("Adjusted R-squared for full:", adj_rsq_full_model2, "\n")
Adjusted R-squared for simple model: 0.8465219
Adjusted R-squared for full model: 0.9065885
The full model is more effective at explaining the variation in the SVI scores, and thus may have a higher predictive capability.
# Calculate VIF for initial_model
vif_simple_model <- car::vif(simple_model)
# Calculate VIF for initial_model2
vif_full_model <- car::vif(full_model)
# Compare VIF values
cat("VIF for simple_model:", vif_simple_model, "\n")
cat("VIF for full_model:", vif_full_model, "\n")
Simple Model VIF: 4.30, 3.51, 3.64, 1.98
Full Model VIF: 5.02, 3.78, 2.14, 2.23, 1.80, 1.77, 1.71, 2.73, 3.34, 1.25, 3.95, 4.23
Both simple and full models have VIF values under 10, and only one value (income) in the full model is slightly above 5. This is still within the range of acceptable values, therefore it is appropriate to use the full model.
As a result of the feature selection process, it was determined that the full model is appropriate and the best fitting model to use. Therefore, the variables from this full model will be used to analyze the question of it is possible to classifify census tracts into levels of SVI risk.
From the CDC website, the recommended levels of risk to model are:
“In the CDC/ATSDR SVI Interactive Map, we classify data using quartiles (0 to .2500, .2501 to .5000, .5001 to .7500, .7501 to 1.0) and indicate that the classification goes from least vulnerable to most vulnerable.”
The initial data processing required to classify the census tracts is to assign each tract a level of low, medium-low, medium-high, and high based on the quartile breaks. Below is a plot of the census tracts broken into quartiles.
# Create quartiles and labels
trim_svi$risk <- cut(trim_svi$RPL_THEMES, breaks = c(0, 0.25, 0.5, 0.75, 1.0), labels = c("low", "lowmed", "medhigh", "high"))
# Plot the data with quartile labels
library(ggplot2)
ggplot(trim_svi, aes(x = risk, fill = risk)) +
geom_bar() +
labs(title = "Bar Plot of SVI Quartiles", x = "SVI Risk", y = "Count") +
scale_fill_manual(values = c("low" = "#1a9641", "lowmed" = "#a6d96a", "medhigh" = "#fdae61", "high" = "#d7191c"))+
scale_x_discrete(labels = c("low" = "Low", "lowmed" = "Low-Medium", "medhigh" = "Medium-High", "high" = "High"))+
theme_minimal() +
theme(legend.position = "none", text = element_text(family = "Avenir"))
From this plot, it is seen that in California there are more medium-high and high risk census tracts than low and medium-low risk counties. This may cause issues in classifying the data due to the fact that there is an imbalance of data in each target class, with a bias towards more high risk tracts.
Classification Trees
Classification trees with a max-depth of 4 were used to classify the census tracts into the quartile rankings. While this may not be the best fitting size of the tree, anything beyond 4 levels may become difficult to interpret, while the aim of the project is to create an easily interpret-able figure for first responders or government officials. Interpret-ability of decision trees is a plus, but it may not be the most effective or comprehensive algorithm.
set.seed(1)
svi_econ_tree <- rpart(risk ~ income_scaled + EPL_POV150 + EPL_UNINSUR + EPL_AGE65 + EPL_AGE17 + EPL_DISABL + EPL_SNGPNT + EPL_LIMENG + EPL_MINRTY + EPL_UNEMP + EPL_HBURD + EPL_NOHSDP , data=trim_svi, method="class", control = list(maxdepth = 4) )
#summary(svi_econ_tree)
#plot(svi_econ_tree, uniform=TRUE, main="Classification Tree for svi_econ_tree")
#text(svi_econ_tree, use.n=TRUE, all=TRUE, cex=.8)
This figure depicts the decision tree used to classify census tracts
into the rankings. From this tree, it is possible to see which variables
are used to determine the classification of the tract and at what value.
These variables are: EPL_POV150 (Population in poverty over
150%), EPL_NOHSDP (No high school diploma),
income_scaled (Median income), EPL_HBURD
(Housing burdened).
fancyRpartPlot(svi_econ_tree)
Below is a plot of the Cross-Validation Relative Error of the tree. Relative error is a metric used to assess the predictive performance of a model, which measures the difference between predicted values and actual values relative to the actual values. The x-axis labeled “cp” refers to the complexity parameter. A smaller value of “cp” indicates more aggressive pruning, resulting in a smaller and simpler tree, which is a trade-off with higher predictive accuracy. The second x-axis is the size of the tree, referring to how many splits there are.
This is an important graph to analyze, as when a decision tree is grown, it may become too overfit, capturing noise instead of real relationships. The goal is to find a “cp” that results in a well-balanced tree, providing good predictive performance on both the training and validation data. This tree’s relative error ranks from 0.4 to 1, which implies even the biggest tree still has a relative error of 40%.
#printcp(svi_econ_tree)
plotcp(svi_econ_tree)
Confusion Matrix: Below is a confusion matrix using the above classification tree. From initial observations, it is clear that this model is not accurately classifying the census tracts into quartile risk classes.
# Generate predictions on the training data
train_predictions <- predict(svi_econ_tree, trim_svi, type = "class")
# Create a confusion matrix
conf_matrix_train <- table(train_predictions, trim_svi$risk)
xkabledply(conf_matrix_train, "confusion matrix")
| low | lowmed | medhigh | high | |
|---|---|---|---|---|
| low | 1004 | 377 | 38 | 3 |
| lowmed | 403 | 1104 | 503 | 23 |
| medhigh | 46 | 420 | 1153 | 393 |
| high | 2 | 32 | 537 | 2918 |
From the confusion matrix, it is possible to gain information about the accuracies of this classification model. For this analysis, the algorithm is classifying a categorical variable into 4 levels. Therefore, the table below depicts the overall accuracy, and then the precision and recall by class.
# Convert to confusion matrix object
conf_matrix <- as.table(conf_matrix_train)
# Print the confusion matrix
# Calculate metrics
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
precision <- diag(conf_matrix) / rowSums(conf_matrix)
recall <- diag(conf_matrix) / colSums(conf_matrix)
f1_score <- 2 * (precision * recall) / (precision + recall)
# Print metrics
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.6899285
cat("Precision (by class):", precision, "\n")
## Precision (by class): 0.7060478 0.5430398 0.5730616 0.8363428
cat("Recall (by class):", recall, "\n")
## Recall (by class): 0.6900344 0.571133 0.5168086 0.8744381
#cat("F1 Score (by class):", f1_score, "\n")
As seen in the results, the overall accuracy is fairly low, at 68.9%. The decision tree is better at classifying levels of high risk. This is likely due to the fact that there are more census tracts that are higher risk, due to imbalanced data. Additionally, the model might not be as accurate in picking up the differences between low and low-medium and medium and medium-high, as these might be small variations in scores and variable values.
Below is a table of the evaluation statistics for the model.
loadPkg("rpart")
loadPkg("caret")
# create an empty dataframe to store the results from confusion matrices
confusionMatrixResultDf = data.frame( Depth=numeric(0), Accuracy= numeric(0), Sensitivity=numeric(0), Specificity=numeric(0), Pos.Pred.Value=numeric(0), Neg.Pred.Value=numeric(0), Precision=numeric(0), Recall=numeric(0), F1=numeric(0), Prevalence=numeric(0), Detection.Rate=numeric(0), Detection.Prevalence=numeric(0), Balanced.Accuracy=numeric(0), row.names = NULL )
for (deep in 2:6) {
kfit <- rpart(risk ~ income_scaled + EPL_POV150 + EPL_UNINSUR + EPL_AGE65 + EPL_AGE17 + EPL_DISABL + EPL_SNGPNT + EPL_LIMENG + EPL_MINRTY + EPL_UNEMP + EPL_HBURD + EPL_NOHSDP , data=trim_svi, method="class", control = list(maxdepth = deep) )
#
cm = confusionMatrix( predict(kfit, type = "class"), reference = trim_svi[, "risk"] ) # from caret library
#
cmaccu = cm$overall['Accuracy']
# print( paste("Total Accuracy = ", cmaccu ) )
#
cmt = data.frame(Depth=deep, Accuracy = cmaccu, row.names = NULL ) # initialize a row of the metrics
cmt = cbind( cmt, data.frame( t(cm$byClass) ) ) # the dataframe of the transpose, with k valued added in front
confusionMatrixResultDf = rbind(confusionMatrixResultDf, cmt)
# print("Other metrics : ")
}
unloadPkg("caret")
xkabledply(confusionMatrixResultDf, title="SVI econ Classification Trees summary with varying MaxDepth")
| Depth | Accuracy | Class..low | Class..lowmed | Class..medhigh | Class..high | |
|---|---|---|---|---|---|---|
| Sensitivity | 2 | 0.6175 | 0.8570 | 0.0000 | 0.7571 | 0.7773 |
| Specificity | 2 | 0.6175 | 0.8400 | 1.0000 | 0.7206 | 0.9382 |
| Pos Pred Value | 2 | 0.6175 | 0.5096 | NaN | 0.4734 | 0.8820 |
| Neg Pred Value | 2 | 0.6175 | 0.9680 | 0.7842 | 0.8994 | 0.8765 |
| Precision | 2 | 0.6175 | 0.5096 | NA | 0.4734 | 0.8820 |
| Recall | 2 | 0.6175 | 0.8570 | 0.0000 | 0.7571 | 0.7773 |
| F1 | 2 | 0.6175 | 0.6392 | NA | 0.5825 | 0.8264 |
| Prevalence | 2 | 0.6175 | 0.1625 | 0.2158 | 0.2491 | 0.3726 |
| Detection Rate | 2 | 0.6175 | 0.1392 | 0.0000 | 0.1886 | 0.2896 |
| Detection Prevalence | 2 | 0.6175 | 0.2732 | 0.0000 | 0.3984 | 0.3284 |
| Balanced Accuracy | 2 | 0.6175 | 0.8485 | 0.5000 | 0.7388 | 0.8578 |
| Sensitivity1 | 3 | 0.6663 | 0.4584 | 0.6958 | 0.6100 | 0.7773 |
| Specificity1 | 3 | 0.6663 | 0.9809 | 0.8149 | 0.8217 | 0.9382 |
| Pos Pred Value1 | 3 | 0.6663 | 0.8235 | 0.5085 | 0.5316 | 0.8820 |
| Neg Pred Value1 | 3 | 0.6663 | 0.9033 | 0.9068 | 0.8640 | 0.8765 |
| Precision1 | 3 | 0.6663 | 0.8235 | 0.5085 | 0.5316 | 0.8820 |
| Recall1 | 3 | 0.6663 | 0.4584 | 0.6958 | 0.6100 | 0.7773 |
| F11 | 3 | 0.6663 | 0.5890 | 0.5876 | 0.5681 | 0.8264 |
| Prevalence1 | 3 | 0.6663 | 0.1625 | 0.2158 | 0.2491 | 0.3726 |
| Detection Rate1 | 3 | 0.6663 | 0.0745 | 0.1502 | 0.1520 | 0.2896 |
| Detection Prevalence1 | 3 | 0.6663 | 0.0904 | 0.2953 | 0.2858 | 0.3284 |
| Balanced Accuracy1 | 3 | 0.6663 | 0.7197 | 0.7554 | 0.7159 | 0.8578 |
| Sensitivity2 | 4 | 0.6899 | 0.6900 | 0.5711 | 0.5168 | 0.8744 |
| Specificity2 | 4 | 0.6899 | 0.9443 | 0.8677 | 0.8723 | 0.8984 |
| Pos Pred Value2 | 4 | 0.6899 | 0.7060 | 0.5430 | 0.5731 | 0.8363 |
| Neg Pred Value2 | 4 | 0.6899 | 0.9401 | 0.8803 | 0.8448 | 0.9234 |
| Precision2 | 4 | 0.6899 | 0.7060 | 0.5430 | 0.5731 | 0.8363 |
| Recall2 | 4 | 0.6899 | 0.6900 | 0.5711 | 0.5168 | 0.8744 |
| F12 | 4 | 0.6899 | 0.6979 | 0.5567 | 0.5435 | 0.8550 |
| Prevalence2 | 4 | 0.6899 | 0.1625 | 0.2158 | 0.2491 | 0.3726 |
| Detection Rate2 | 4 | 0.6899 | 0.1121 | 0.1233 | 0.1287 | 0.3258 |
| Detection Prevalence2 | 4 | 0.6899 | 0.1588 | 0.2270 | 0.2247 | 0.3896 |
| Balanced Accuracy2 | 4 | 0.6899 | 0.8172 | 0.7194 | 0.6945 | 0.8864 |
| Sensitivity3 | 5 | 0.6899 | 0.6900 | 0.5711 | 0.5168 | 0.8744 |
| Specificity3 | 5 | 0.6899 | 0.9443 | 0.8677 | 0.8723 | 0.8984 |
| Pos Pred Value3 | 5 | 0.6899 | 0.7060 | 0.5430 | 0.5731 | 0.8363 |
| Neg Pred Value3 | 5 | 0.6899 | 0.9401 | 0.8803 | 0.8448 | 0.9234 |
| Precision3 | 5 | 0.6899 | 0.7060 | 0.5430 | 0.5731 | 0.8363 |
| Recall3 | 5 | 0.6899 | 0.6900 | 0.5711 | 0.5168 | 0.8744 |
| F13 | 5 | 0.6899 | 0.6979 | 0.5567 | 0.5435 | 0.8550 |
| Prevalence3 | 5 | 0.6899 | 0.1625 | 0.2158 | 0.2491 | 0.3726 |
| Detection Rate3 | 5 | 0.6899 | 0.1121 | 0.1233 | 0.1287 | 0.3258 |
| Detection Prevalence3 | 5 | 0.6899 | 0.1588 | 0.2270 | 0.2247 | 0.3896 |
| Balanced Accuracy3 | 5 | 0.6899 | 0.8172 | 0.7194 | 0.6945 | 0.8864 |
| Sensitivity4 | 6 | 0.6899 | 0.6900 | 0.5711 | 0.5168 | 0.8744 |
| Specificity4 | 6 | 0.6899 | 0.9443 | 0.8677 | 0.8723 | 0.8984 |
| Pos Pred Value4 | 6 | 0.6899 | 0.7060 | 0.5430 | 0.5731 | 0.8363 |
| Neg Pred Value4 | 6 | 0.6899 | 0.9401 | 0.8803 | 0.8448 | 0.9234 |
| Precision4 | 6 | 0.6899 | 0.7060 | 0.5430 | 0.5731 | 0.8363 |
| Recall4 | 6 | 0.6899 | 0.6900 | 0.5711 | 0.5168 | 0.8744 |
| F14 | 6 | 0.6899 | 0.6979 | 0.5567 | 0.5435 | 0.8550 |
| Prevalence4 | 6 | 0.6899 | 0.1625 | 0.2158 | 0.2491 | 0.3726 |
| Detection Rate4 | 6 | 0.6899 | 0.1121 | 0.1233 | 0.1287 | 0.3258 |
| Detection Prevalence4 | 6 | 0.6899 | 0.1588 | 0.2270 | 0.2247 | 0.3896 |
| Balanced Accuracy4 | 6 | 0.6899 | 0.8172 | 0.7194 | 0.6945 | 0.8864 |
Instead of classifying into 4 levels of risk, splitting the census tracts into low versus high risk could potentially increase accuracy. Therefore, the final model that could be provided to emergency responders or government officials would provide a high level overview of what sort of variables and values would cause a census tracts to have high risk or low risk.
While the data is still imbalanced, the model might have a better accuracy by telling apart this coarser distinction rather than the small differences in values.
# Create binary classification (low vs high) based on the provided breaks
trim_svi$risk_binary <- cut(trim_svi$RPL_THEMES, breaks = c(0, 0.5, 1), labels = c("low", "high"))
ggplot(trim_svi, aes(x = risk_binary, fill = risk_binary)) +
geom_bar() +
labs(title = "Bar Plot of SVI Risk (Binary)", x = "SVI Risk", y = "Count") +
scale_fill_manual(values = c("low" = "#1a9641", "high" = "#d7191c")) +
scale_x_discrete(labels = c("low" = "Low (0-0.50)", "high" = "High (0.50-1)")) +
theme_minimal() +
theme(legend.position = "none", text = element_text(family = "Avenir"))
# Decision Tree Modeling
set.seed(1)
svi_econ_tree <- rpart(risk_binary ~ income_scaled + EPL_POV150 + EPL_UNINSUR + EPL_AGE65 + EPL_AGE17 + EPL_DISABL + EPL_SNGPNT + EPL_LIMENG + EPL_MINRTY + EPL_UNEMP + EPL_HBURD + EPL_NOHSDP, data = trim_svi, method = "class", control = list(maxdepth = 4))
printcp(svi_econ_tree) # display the results
##
## Classification tree:
## rpart(formula = risk_binary ~ income_scaled + EPL_POV150 + EPL_UNINSUR +
## EPL_AGE65 + EPL_AGE17 + EPL_DISABL + EPL_SNGPNT + EPL_LIMENG +
## EPL_MINRTY + EPL_UNEMP + EPL_HBURD + EPL_NOHSDP, data = trim_svi,
## method = "class", control = list(maxdepth = 4))
##
## Variables actually used in tree construction:
## [1] EPL_NOHSDP EPL_POV150 income_scaled
##
## Root node error: 3388/8956 = 0.37829
##
## n= 8956
##
## CP nsplit rel error xerror xstd
## 1 0.598288 0 1.00000 1.00000 0.0135463
## 2 0.033353 1 0.40171 0.41765 0.0101881
## 3 0.028335 3 0.33501 0.37131 0.0097057
## 4 0.022432 4 0.30667 0.34002 0.0093516
## 5 0.010000 5 0.28424 0.30579 0.0089339
Below is a confusion matrix of the values.
# Confusion Matrix
train_predictions <- predict(svi_econ_tree, trim_svi, type = "class")
conf_matrix_train <- table(train_predictions, trim_svi$risk_binary)
# Display the confusion matrix
xkabledply(conf_matrix_train, "confusion matrix")
| low | high | |
|---|---|---|
| low | 2920 | 495 |
| high | 468 | 5073 |
# Convert to confusion matrix object
conf_matrix <- as.table(conf_matrix_train)
From the confusion matrix, the overall accuracy, precision for each class, and recall for each class are shown below.
# True Positives, False Positives, False Negatives, True Negatives
tp <- conf_matrix_train["high", "high"]
fp <- conf_matrix_train["low", "high"]
fn <- conf_matrix_train["high", "low"]
tn <- conf_matrix_train["low", "low"]
# Accuracy
accuracy <- (tp + tn) / sum(conf_matrix_train)
cat("Overall Accuracy:", accuracy, "\n")
# Precision for "high"
precision_high <- tp / (tp + fp)
cat("Precision for 'high':", precision_high, "\n")
# Recall for "high"
recall_high <- tp / (tp + fn)
cat("Recall for 'high':", recall_high, "\n")
# Precision for "low"
precision_low <- tn / (tn + fn)
cat("Precision for 'low':", precision_low, "\n")
# Recall for "low"
recall_low <- tn / (tn + fp)
cat("Recall for 'low':", recall_low, "\n")
From this analysis, it is seen that this model with the binary classification has an overall accuracy of 89.24%. This is a higher overall accuracy compared to the previous model scoring 68.9%. The individual precision and recall values for this model are below:
Precision for ‘high’: 0.911
Recall for ‘high’: 0.9155
Precision for ‘low’: 0.8619
Recall for ‘low’: 0.8550
These individual measures of accuracies are higher overall compared to the other classification. This algorithm does a better job at accurately predicting census tract risk scores, and avoids FP and FN more effectively. Overall, this model is still more effective at classifying high risk, probably still due to the imbalanced data.
Using this binary classification, a classification tree can be provided to identify low versus high risk tracts, with clearly identifiable variables and values that determine the classification decisions. This deliverable, or an example of one, can be seen below.
set.seed(1)
svi_econ_tree <- rpart(risk_binary ~ income_scaled + EPL_POV150 + EPL_UNINSUR + EPL_AGE65 + EPL_AGE17 + EPL_DISABL + EPL_SNGPNT + EPL_LIMENG + EPL_MINRTY + EPL_UNEMP + EPL_HBURD + EPL_NOHSDP , data=trim_svi, method="class", control = list(maxdepth = 4) )
fancyRpartPlot(svi_econ_tree)
Conclusion: From this decision tree, it can be
sugested that the variables planners/officials could look at to
determine high versus low risk are EPL_POV150 (poverty rate
above 150%), EPL_NOHSDP (no high school diploma), and
income_scaled (median income). This is an interesting
finding that provides further insights as to which variables are
significant or meaningful in modeling the SVI risk of census tracts,
building off of the correlations identified project 1 and contributing
the other findings and models analyzed in this project. It can also be
concluded that the full model is the more representative model, and it
does a better job at classifying census tract risk. Additionally, the
model does a better job at classifying high vs low risk, rather than the
four level risk classification. This suggests that to examine more fine
scale risk, it would be necessary to introduce more data, either in
bootstrapping and resampling, or fine tuning parameters of the models to
create a better and more accurate model.
Data cleaning is an essential step in data analysis to ensure accuracy and reliability. The given R code snippets demonstrate key data cleaning procedures such as column subsetting and renaming, missing values removal, formatting, and row subsetting.
Column Subsetting and Renaming
This process involves selecting specific columns from a dataset and renaming them for clarity and consistency. In the provided code, the subset function is used to select relevant columns from the SVI_Data dataset, creating a new dataframe SVI. Further, the columns ‘STCNTY’ and ‘us_cases’ were renamed for clarity
SVI <- subset(SVI_Data, select = c(STATE,STCNTY,COUNTY,FIPS,EP_POV150,EP_LIMENG,EP_AGE65, EP_MINRTY, EP_UNEMP, EP_NOHSDP,EP_AGE17,EP_DISABL,EP_SNGPNT,EP_MUNIT,EP_MOBILE,EP_CROWD,EP_NOVEH,EP_GROUPQ,RPL_THEMES) )
# Renaming columns
colnames(SVI)[colnames(SVI) == 'STCNTY'] <- 'county_fips'
us_cases <- rename(us_cases, confirmed_cases = confirmed)
Missing values removal
Dealing with missing or invalid data is crucial. The code filters out rows in the SVI dataframe ensuring removal of placeholder or erroneous entries. Additionally, the na.omit function removes rows with any missing values (NA). This step is also applied to other datasets (us_cases and us_deaths) to identify rows with missing data in specific columns.
# Filter out rows with any value equal to -999
SVI <- SVI[rowSums(SVI == -999, na.rm = TRUE) == 0, ]
# Drop rows with missing values
SVI <- na.omit(SVI)
us_cases[is.na(us_cases$confirmed_cases), ]
## [1] X county_fips county_name state_name
## [5] state_fips date confirmed_cases lat
## [9] long geometry
## <0 rows> (or 0-length row.names)
us_deaths[is.na(us_deaths$deaths), ]
## [1] X county_fips county_name state_name state_fips date
## [7] deaths lat long geometry
## <0 rows> (or 0-length row.names)
Formatting
Data often requires conversion into a usable format. In this step, the ‘date’ columns in us_cases and us_deaths datasets are converted to POSIXct datetime objects using the as.POSIXct function. This conversion, dictated by a predefined format string (format_str), is essential for time-series analysis or operations that require date-time formatted data.
format_str <- '%Y-%m-%d'
# Convert the 'date' column to datetime and create a new 'datetime' column
us_cases$datetime <- as.POSIXct(us_cases$date, format = format_str)
us_deaths$datetime <- as.POSIXct(us_deaths$date, format = format_str)
Row Subsetting
This step involves selecting specific rows based on certain conditions. For the SVI dataset, rows pertaining to the state of California are extracted into CA_SVI. In us_cases and us_deaths, rows labeled as ‘Statewide Unallocated’ are excluded to focus on specific counties. Additionally, subsets of the us_cases and us_deaths datasets are created for the state of California (‘CA’), allowing for a more focused analysis on this region.
CA_SVI <- subset(SVI, STATE == "California")
# Drop rows that do not refer to a specific county ('Statewide Unallocated')
us_cases <- us_cases[us_cases$county_name != 'Statewide Unallocated', ]
us_deaths <- us_deaths[us_deaths$county_name != 'Statewide Unallocated', ]
us_cases_ca <- subset(us_cases, state_name == "CA")
us_deaths_ca <- subset(us_deaths, state_name == "CA")
The below EDA techniques focuses on creating maps and timeline for COVID-19 in California.
Look at the Data
Before we procede to the EDA lets look at our data
The COVID-19 confirmed cases dataset:
xkabledplyhead(us_cases_ca)
| X | county_fips | county_name | state_name | state_fips | date | confirmed_cases | lat | long | geometry | datetime | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 36097 | 36096 | 6000 | Grand Princess Cruise Ship | CA | 6 | 2020-01-22 | 0 | NA | NA | 2020-01-22 | |
| 36098 | 36097 | 6000 | Grand Princess Cruise Ship | CA | 6 | 2020-01-23 | 0 | NA | NA | 2020-01-23 | |
| 36099 | 36098 | 6000 | Grand Princess Cruise Ship | CA | 6 | 2020-01-24 | 0 | NA | NA | 2020-01-24 | |
| 36100 | 36099 | 6000 | Grand Princess Cruise Ship | CA | 6 | 2020-01-25 | 0 | NA | NA | 2020-01-25 | |
| 36101 | 36100 | 6000 | Grand Princess Cruise Ship | CA | 6 | 2020-01-26 | 0 | NA | NA | 2020-01-26 |
The COVID-19 deaths dataset:
xkabledplyhead(us_deaths_ca)
| X | county_fips | county_name | state_name | state_fips | date | deaths | lat | long | geometry | datetime | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 36097 | 36096 | 6000 | Grand Princess Cruise Ship | CA | 6 | 2020-01-22 | 0 | NA | NA | 2020-01-22 | |
| 36098 | 36097 | 6000 | Grand Princess Cruise Ship | CA | 6 | 2020-01-23 | 0 | NA | NA | 2020-01-23 | |
| 36099 | 36098 | 6000 | Grand Princess Cruise Ship | CA | 6 | 2020-01-24 | 0 | NA | NA | 2020-01-24 | |
| 36100 | 36099 | 6000 | Grand Princess Cruise Ship | CA | 6 | 2020-01-25 | 0 | NA | NA | 2020-01-25 | |
| 36101 | 36100 | 6000 | Grand Princess Cruise Ship | CA | 6 | 2020-01-26 | 0 | NA | NA | 2020-01-26 |
Data Aggregation and Summarization:
We start with aggregating and summarizing confirmed cases and deaths by county. The data is then grouped by county_fips and county_name within the state of California (us_cases_ca and us_deaths_ca). Summations of confirmed cases and deaths are calculated using the summarise function. The arrange function sorts these summaries in descending order, highlighting areas with the highest impact. This step is crucial for understanding the distribution and impact of COVID-19 at the county level.
Map of CA colored by confirmed cases
For confirmed cases, a choropleth map of California (plot_usmap) is generated, where each county is colored based on the number of confirmed COVID-19 cases (Summarized_cases). The map is enhanced with a color gradient (white to dark blue), making it easy to identify areas with higher numbers of cases.
# Plot the choropleth map
plot_usmap(
data = Summarized_cases,
values = "confirmed",
include = c("CA"),
labels = TRUE
) +
scale_fill_gradient(
low = "white", high = "darkblue", name = "Confirmed Cases", label = scales::comma) +
labs(title = "COVID-19 Confirmed Cases in California Counties") +
theme(legend.position = "right")
Observation:
The darkest shade of blue, which suggests the highest number of confirmed cases, appears concentrated in one specific area (Los Angeles). This suggests that there is a significant outbreak or a larger population in this county, which could potentially be one of the more urban areas with higher population density.
The gradient scale on the right indicates that the highest number of confirmed cases in a single county goes up to 6,000,000, which is a substantial figure. This suggests a high transmission rate or a prolonged presence of the virus in the most affected areas.
Most counties appear to have a much lower number of cases, as indicated by the light blue or white color, showing a significant disparity in case distribution across the state.
Map of CA colored by deaths
For number of deaths, a choropleth map of California (plot_usmap) is generated, where each county is colored based on the number of COVID-19 deaths (Summarized_deaths). The map is enhanced with a color gradient (white to dark blue), making it easy to identify areas with higher numbers of deaths.
# Plot the choropleth map
plot_usmap(
data = Summarized_deaths,
values = "deaths",
include = c("CA"),
labels = TRUE
) +
scale_fill_gradient(
low = "white", high = "darkblue", name = "Number of Deaths", label = scales::comma) +
labs(title = "COVID-19 Deaths in California Counties") +
theme(legend.position = "right")
Observation:
The second map represents the number of deaths attributed to COVID-19. Again, the same county (LA) that had the highest number of confirmed cases also has the highest number of deaths, which is consistent with expectations.
The scale for deaths is significantly lower than that for confirmed cases, topping out at 250,000 deaths in the most affected county. This disparity between the number of cases and deaths may be due to various factors, including the healthcare system’s capacity, the demographic profile of the affected population, and the measures taken to treat and prevent the spread of the virus.
Similar to the confirmed cases, the rest of the counties show fewer deaths, indicating a lesser impact or effective management and response to the pandemic in these regions.
Here we perform data aggregation and summarization tasks on two separate datasets: one containing COVID-19 data(confirmed cases) and the other containing vulnerability indicators (referred to as SVI).
For the COVID-19 data, the code groups the data by county FIPS code, county name, and state name to organize the cases geographically. It then calculates the total number of confirmed COVID-19 cases for each county and sorts these totals in descending order, so that counties with the most cases and deaths appear first.
For the SVI data, the code similarly groups the data by geographical identifiers but goes on to calculate the average for a range of vulnerability indicators within each county. These indicators include measures of poverty, linguistic isolation, the proportion of the elderly population, minority status, unemployment rates, education levels, and others that together create a profile of each county’s social vulnerability.
head(svi_all_counties_cases)
## county_fips COUNTY STATE EP_POV150 EP_LIMENG EP_AGE65 EP_MINRTY
## 1 6037 Los Angeles California 23.87737 12.828138 13.89510 72.85186
## 2 6065 Riverside California 23.09419 8.239341 16.69419 63.61318
## 3 6059 Orange California 16.28873 8.660458 15.87712 57.10507
## 4 6073 San Diego California 18.12154 6.174760 14.85514 53.03333
## 5 6071 San Bernardino California 25.56299 7.435714 12.62857 69.68442
## 6 6001 Alameda California 15.24695 7.488064 14.21671 67.70398
## EP_UNEMP EP_NOHSDP EP_AGE17 EP_DISABL EP_SNGPNT EP_MUNIT EP_MOBILE EP_CROWD
## 1 6.473968 20.79826 21.18955 10.177733 6.566640 24.788340 1.696275 12.440850
## 2 7.513178 18.21647 23.74128 12.406589 6.296705 7.409884 8.743023 7.675194
## 3 4.943954 13.27876 21.45654 9.089052 5.101307 17.061765 2.833497 9.299020
## 4 6.215501 11.81687 20.85789 10.109602 5.670508 20.062963 3.522908 7.106996
## 5 7.477273 19.43745 26.06212 11.952597 8.187013 9.546537 5.960606 9.195238
## 6 4.830769 11.06711 19.63873 9.510345 5.194430 22.045093 1.136870 8.306101
## EP_NOVEH EP_GROUPQ RPL_THEMES confirmed_cases
## 1 8.950891 1.643563 0.6599071 7627646
## 2 4.189535 1.295349 0.6282279 1314109
## 3 4.304248 1.171405 0.5076575 1139456
## 4 5.362414 1.894650 0.5263907 1112369
## 5 4.872727 1.179654 0.6663433 942421
## 6 9.815385 2.041379 0.5008411 466773
Above table shows the top counties with the highest COVID-19 cases and their mean SVI
Create timeline of covid confirmed cases
The aggreagated data created before is used to visualize the timeline of COVID-19 cases in different counties in CA
# Which are the 10 most affected US counties?
ca_cases_top_10 <- head(us_cases_all_counties, 10)
# Preparing a list with the 10 most affected counties:
top_10_cases_fips_list <- ca_cases_top_10$county_fips
ca_cases_top_10_datetime <- subset(us_cases_ca, county_fips %in% top_10_cases_fips_list)
# Plot Timeline of cases
ca_cases_top_10_datetime$datetime <- as.Date(ca_cases_top_10_datetime$datetime) # Make sure datetime is in Date format
f <- ggplot(ca_cases_top_10_datetime, aes(x = datetime, y = confirmed_cases, color = county_name)) +
geom_line() +
labs(x = 'Timeline', y = 'Confirmed Cases', title = 'Confirmed cases in CA counties') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_color_discrete(name = 'County Name')+
scale_x_date(date_breaks = "1 month", date_labels = "%b %Y") # Format the x-axis with month and year
print(f)
Observation
Los Angeles County stands out with a pronounced exponential growth curve, indicating a rapid increase in confirmed cases, surpassing 150,000 by August 2020. Other counties such as San Diego, Orange, and Riverside also show a rise in cases, but none as dramatic as Los Angeles County. This graph emphasizes the wide variation in case numbers between counties, with some experiencing relatively moderate increases while one, in particular, sees a far more aggressive spread of the virus.
Here we perform data aggregation and summarization tasks on two separate datasets: one containing COVID-19 data(confirmed cases) and the other containing vulnerability indicators (referred to as SVI).
For the COVID-19 data, the code groups the data by county FIPS code, county name, and state name to organize the cases geographically. It then calculates the total number of confirmed COVID-19 cases for each county and sorts these totals in descending order, so that counties with the most cases and deaths appear first.
For the SVI data, the code similarly groups the data by geographical identifiers but goes on to calculate the average for a range of vulnerability indicators within each county. These indicators include measures of poverty, linguistic isolation, the proportion of the elderly population, minority status, unemployment rates, education levels, and others that together create a profile of each county’s social vulnerability.
head(svi_all_counties_deaths)
## county_fips COUNTY STATE EP_POV150 EP_LIMENG EP_AGE65 EP_MINRTY
## 1 6037 Los Angeles California 23.87737 12.828138 13.89510 72.85186
## 2 6065 Riverside California 23.09419 8.239341 16.69419 63.61318
## 3 6073 San Diego California 18.12154 6.174760 14.85514 53.03333
## 4 6059 Orange California 16.28873 8.660458 15.87712 57.10507
## 5 6071 San Bernardino California 25.56299 7.435714 12.62857 69.68442
## 6 6085 Santa Clara California 11.68260 8.835294 13.83750 68.57598
## EP_UNEMP EP_NOHSDP EP_AGE17 EP_DISABL EP_SNGPNT EP_MUNIT EP_MOBILE EP_CROWD
## 1 6.473968 20.79826 21.18955 10.177733 6.566640 24.788340 1.696275 12.440850
## 2 7.513178 18.21647 23.74128 12.406589 6.296705 7.409884 8.743023 7.675194
## 3 6.215501 11.81687 20.85789 10.109602 5.670508 20.062963 3.522908 7.106996
## 4 4.943954 13.27876 21.45654 9.089052 5.101307 17.061765 2.833497 9.299020
## 5 7.477273 19.43745 26.06212 11.952597 8.187013 9.546537 5.960606 9.195238
## 6 4.212745 11.63505 21.92353 8.268382 4.490441 19.685049 2.934559 8.681127
## EP_NOVEH EP_GROUPQ RPL_THEMES deaths
## 1 8.950891 1.643563 0.6599071 262786
## 2 4.189535 1.295349 0.6282279 36583
## 3 5.362414 1.894650 0.5263907 30093
## 4 4.304248 1.171405 0.5076575 22931
## 5 4.872727 1.179654 0.6663433 21268
## 6 5.313235 1.862500 0.4473218 15714
Above table shows the top counties with the highest COVID-19 deaths and their mean SVI
Create timeline of covid death
The aggreagated data created before is used to visualize the timeline of COVID-19 cases in different counties in CA
# Which are the 10 most affected US counties?
ca_deaths_top_10 <- head(us_deaths_all_counties, 10)
# Preparing a list with the 10 most affected counties:
top_10_deaths_fips_list <- ca_deaths_top_10$county_fips
ca_deaths_top_10_datetime <- subset(us_deaths_ca, county_fips %in% top_10_deaths_fips_list)
# Plot Timeline of cases
ca_deaths_top_10_datetime$datetime <- as.Date(ca_deaths_top_10_datetime$datetime) # Make sure datetime is in Date format
f <- ggplot(ca_deaths_top_10_datetime, aes(x = datetime, y = deaths, color = county_name)) +
geom_line() +
labs(x = 'Timeline', y = 'deaths', title = 'deaths in CA counties') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_color_discrete(name = 'County Name')+
scale_x_date(date_breaks = "1 month", date_labels = "%b %Y") # Format the x-axis with month and year
print(f)
Observation
The graph shows a timeline of COVID-19 cases, where the number of deaths in Los Angeles County appears to exceed 4,000. In contrast, other counties represented on the graph, such as Alameda, Orange, Riverside, and others, show a much flatter curve with far fewer deaths over the same period. This visualization starkly illustrates the disparate impact of the pandemic across different regions within the state, with Los Angeles County being the most severely affected in terms of mortality. Both graphs provide a clear visual comparison of the pandemic’s impact at the county level and underline the importance of localized data in understanding and responding to the spread of COVID-19.
Corelation matrix between confirmed cases and SVI variables
# Plot the correlation matrix
# Select relevant columns
svi_all_counties_cases_corr <- svi_all_counties_cases[c('confirmed_cases', 'EP_POV150', 'EP_LIMENG', 'EP_AGE65', 'EP_MINRTY', 'EP_UNEMP', 'EP_NOHSDP', 'EP_AGE17', 'EP_DISABL', 'EP_SNGPNT', 'EP_MUNIT', 'EP_MOBILE', 'EP_CROWD', 'EP_NOVEH', 'EP_GROUPQ', 'RPL_THEMES')]
# Calculate correlation matrix
corr <- cor(svi_all_counties_cases_corr)
corrplot(corr, is.corr = TRUE, method = 'color', na.label='NA',col = colorRampPalette(c("white", "darkblue"))(100), addCoef.col = 'black', number.cex = 0.5,mar = c(0, 0, 2, 0), tl.cex = 0.5, tl.col = 'black', tl.srt = 45)
Observation:
The matrix correlates various social vulnerability indices (like poverty, linguistic isolation, age, minority status, unemployment, etc.) with confirmed COVID-19 cases.
There’s a strong positive correlation between confirmed cases and EP_POV150 (representing poverty), EP_MINRTY (minority status), and EP_NOHSDP (individuals without a high school diploma). These strong positive correlations suggest that counties with higher poverty rates, larger minority populations, and more individuals without a high school diploma tend to have more confirmed COVID-19 cases.
EP_LIMENG (limited English proficiency) and EP_AGE65 (seniors over 65) have strong negative correlations with confirmed cases, which might indicate that areas with higher numbers of non-English speakers or senior citizens have fewer confirmed cases. However, this could also be due to under reporting or less testing in these communities.
The correlation with RPL_THEMES is also moderately strong, suggesting that higher overall vulnerability is associated with more confirmed cases.
Corelation matrix between deaths and SVI variables
# Plot the correlation matrix
# Select relevant columns
svi_all_counties_deaths_corr <- svi_all_counties_deaths[c('deaths', 'EP_POV150', 'EP_LIMENG', 'EP_AGE65', 'EP_MINRTY', 'EP_UNEMP', 'EP_NOHSDP', 'EP_AGE17', 'EP_DISABL', 'EP_SNGPNT', 'EP_MUNIT', 'EP_MOBILE', 'EP_CROWD', 'EP_NOVEH', 'EP_GROUPQ', 'RPL_THEMES')]
# Calculate correlation matrix
corr <- cor(svi_all_counties_deaths_corr)
corrplot(corr, is.corr = TRUE, method = 'color', na.label='NA',col = colorRampPalette(c("white", "darkblue"))(100), addCoef.col = 'black', number.cex = 0.5,mar = c(0, 0, 2, 0), tl.cex = 0.5, tl.col = 'black', tl.srt = 45)
Observation:
This matrix analyzes the same set of variables but correlates them with COVID-19 deaths instead of cases.
EP_POV150, EP_MINRTY, and EP_NOHSDP again show moderate to strong positive correlations with COVID-19 deaths, indicating that higher poverty, higher minority populations, and lower education levels are associated with higher death counts.
Similar to the confirmed cases, EP_LIMENG and EP_AGE65 show negative correlations with deaths, which could suggest lower mortality in communities with these characteristics, although the reasons for this would require further investigation.
correlation matrix for SVI variables vs confirmed cases for just the most affected counties
# Select the rows in the SVI dataframe that contain the ten counties with most confirmed cases.
svi_top_10_counties_cases <- SVI[SVI$county_fips %in% top_10_cases_fips_list, ]
# Group by county, COUNTY, and STATE, and calculate the mean values for selected indicators.
svi_top_10_counties_cases <- svi_top_10_counties_cases %>%
group_by(county_fips, COUNTY, STATE) %>%
summarise(
EP_POV150 = mean(EP_POV150),
EP_LIMENG = mean(EP_LIMENG),
EP_AGE65 = mean(EP_AGE65),
EP_MINRTY = mean(EP_MINRTY),
EP_UNEMP = mean(EP_UNEMP),
EP_NOHSDP = mean(EP_NOHSDP),
EP_AGE17 = mean(EP_AGE17),
EP_DISABL = mean(EP_DISABL),
EP_SNGPNT = mean(EP_SNGPNT),
EP_MUNIT = mean(EP_MUNIT),
EP_MOBILE = mean(EP_MOBILE),
EP_CROWD = mean(EP_CROWD),
EP_NOVEH = mean(EP_NOVEH),
EP_GROUPQ = mean(EP_GROUPQ),
RPL_THEMES = mean(RPL_THEMES)
) %>%
ungroup()
## `summarise()` has grouped output by 'county_fips', 'COUNTY'. You can override
## using the `.groups` argument.
# Merge county SVI with confirmed cases of Covid-19
svi_top_10_counties_cases <- merge(svi_top_10_counties_cases, ca_cases_top_10, by = 'county_fips') %>%
arrange(desc(confirmed_cases)) %>%
select(-county_name, -state_name)
# Select columns for correlation analysis
svi_top_10_counties_cases_corr <- svi_top_10_counties_cases[c(
'confirmed_cases', 'EP_POV150', 'EP_LIMENG', 'EP_AGE65', 'EP_MINRTY',
'EP_UNEMP', 'EP_NOHSDP', 'EP_AGE17', 'EP_DISABL', 'EP_SNGPNT',
'EP_MUNIT', 'EP_MOBILE', 'EP_CROWD', 'EP_NOVEH', 'EP_GROUPQ', 'RPL_THEMES'
)]
# Calculate correlation matrix
corr <- cor(svi_top_10_counties_cases_corr)
corrplot(corr, is.corr = TRUE, method = 'color', na.label='NA',col = colorRampPalette(c("white", "darkblue"))(100), addCoef.col = 'black', number.cex = 0.5,mar = c(0, 0, 2, 0), tl.cex = 0.5, tl.col = 'black', tl.srt = 45)
Observation:
There is a notable negative correlation between confirmed cases and EP_DISABL (presumably representing disabled individuals), EP_SNGPNT (single-parent households), and EP_MOBILE (mobile home residency). These negative values suggest that as the proportion of disabled individuals, single-parent households, or mobile home residencies increases, the number of confirmed cases tends to decrease, which could be due to various factors such as isolation or reduced social contact.
The variable EP_UNEMP (unemployment) has a strong positive correlation with EP_NOHSDP (no high school diploma) and EP_POV150 (poverty), which might indicate that socioeconomic factors are interconnected and potentially contribute to vulnerability to COVID-19.
EP_CROWD (crowded housing) shows a substantial positive correlation with confirmed cases, hinting that living conditions with limited space might contribute to the spread of the virus.
The variable RPL_THEMES, with moderate positive correlations across the board, might be an aggregate indicator of risk, showing that overall vulnerability is linked to the number of confirmed cases.
Create correlation matrix for SVI vs number of deaths for just the most affected counties
# Select the rows in the SVI dataframe that contain the ten counties with most confirmed cases.
svi_top_10_counties_deaths <- SVI[SVI$county_fips %in% top_10_deaths_fips_list, ]
# Group by county, COUNTY, and STATE, and calculate the mean values for selected indicators.
svi_top_10_counties_deaths <- svi_top_10_counties_deaths %>%
group_by(county_fips, COUNTY, STATE) %>%
summarise(
EP_POV150 = mean(EP_POV150),
EP_LIMENG = mean(EP_LIMENG),
EP_AGE65 = mean(EP_AGE65),
EP_MINRTY = mean(EP_MINRTY),
EP_UNEMP = mean(EP_UNEMP),
EP_NOHSDP = mean(EP_NOHSDP),
EP_AGE17 = mean(EP_AGE17),
EP_DISABL = mean(EP_DISABL),
EP_SNGPNT = mean(EP_SNGPNT),
EP_MUNIT = mean(EP_MUNIT),
EP_MOBILE = mean(EP_MOBILE),
EP_CROWD = mean(EP_CROWD),
EP_NOVEH = mean(EP_NOVEH),
EP_GROUPQ = mean(EP_GROUPQ),
RPL_THEMES = mean(RPL_THEMES)
) %>%
ungroup()
## `summarise()` has grouped output by 'county_fips', 'COUNTY'. You can override
## using the `.groups` argument.
# Merge county SVI with confirmed cases of Covid-19
svi_top_10_counties_deaths <- merge(svi_top_10_counties_deaths, ca_deaths_top_10, by = 'county_fips') %>%
arrange(desc(deaths)) %>%
select(-county_name, -state_name)
# Select columns for correlation analysis
svi_top_10_counties_deaths_corr <- svi_top_10_counties_deaths[c(
'deaths', 'EP_POV150', 'EP_LIMENG', 'EP_AGE65', 'EP_MINRTY',
'EP_UNEMP', 'EP_NOHSDP', 'EP_AGE17', 'EP_DISABL', 'EP_SNGPNT',
'EP_MUNIT', 'EP_MOBILE', 'EP_CROWD', 'EP_NOVEH', 'EP_GROUPQ', 'RPL_THEMES'
)]
# Calculate correlation matrix
corr <- cor(svi_top_10_counties_deaths_corr)
corrplot(corr, is.corr = TRUE, method = 'color', na.label='NA',col = colorRampPalette(c("white", "darkblue"))(100), addCoef.col = 'black', number.cex = 0.5,mar = c(0, 0, 2, 0), tl.cex = 0.5, tl.col = 'black', tl.srt = 45)
Observation:
EP_LIMENG (limited English proficiency) and EP_MINRTY (minority status) show moderate positive correlations with deaths, suggesting that these communities may experience higher mortality rates from COVID-19.
Interestingly, the negative correlations observed in the confirmed cases matrix are not as pronounced here. EP_DISABL, EP_SNGPNT, and EP_MOBILE do not exhibit strong negative correlations with deaths, which could imply that while these factors are associated with fewer cases, they do not necessarily predict a lower number of deaths.
The EP_CROWD variable also shows a positive correlation with deaths, which supports the idea that crowded living conditions may not only facilitate the spread of the virus but also contribute to higher mortality.
RPL_THEMES again shows a moderate positive correlation with deaths, consistent with the notion that overall social vulnerability is related to worse outcomes in the pandemic.
Conclusion from correlation matrix:
The matrices indicate that social vulnerability factors are significantly correlated with both the spread of COVID-19 and its mortality impact.
There are strong inter-correlations between several vulnerability indicators, suggesting that these factors are often co-occurring in the affected communities.
While both matrices show that higher vulnerability correlates with negative COVID-19 outcomes, the patterns are not identical for confirmed cases and deaths, indicating that different factors may play a role in infection rates versus mortality rates.
Correlation does not always imply causation hence these relationships should be interpreted with caution.
Initial MLR for confirmed cases
library(caret)
# Preparing the independent variable (X) and the dependent variable (y)
y <- svi_all_counties_cases$RPL_THEMES
X <- svi_all_counties_cases[, c('confirmed_cases','EP_POV150','EP_LIMENG','EP_AGE65', 'EP_MINRTY', 'EP_UNEMP', 'EP_NOHSDP','EP_AGE17','EP_DISABL','EP_SNGPNT','EP_MUNIT','EP_MOBILE','EP_CROWD','EP_NOVEH','EP_GROUPQ')]
# Splitting the data into train and test data
set.seed(42) # For reproducibility
train_indices <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[train_indices, ]
X_test <- X[-train_indices, ]
y_train <- y[train_indices]
y_test <- y[-train_indices]
# Initiate the model
lm_model1 <- lm(y_train ~ ., data = cbind(y_train, X_train))
# Predictions on the test set
y_predict <- predict(lm_model1, newdata = X_test)
summary(lm_model1)
##
## Call:
## lm(formula = y_train ~ ., data = cbind(y_train, X_train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.121482 -0.028111 -0.001598 0.027159 0.099589
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.728e-02 2.268e-01 -0.429 0.6709
## confirmed_cases -6.660e-09 1.070e-08 -0.622 0.5381
## EP_POV150 6.504e-03 3.441e-03 1.890 0.0678 .
## EP_LIMENG -8.797e-03 8.352e-03 -1.053 0.3001
## EP_AGE65 4.400e-03 5.197e-03 0.847 0.4035
## EP_MINRTY 1.050e-03 1.782e-03 0.589 0.5599
## EP_UNEMP 1.083e-02 1.006e-02 1.077 0.2897
## EP_NOHSDP 9.497e-03 5.626e-03 1.688 0.1011
## EP_AGE17 1.796e-03 7.887e-03 0.228 0.8213
## EP_DISABL 3.163e-03 6.040e-03 0.524 0.6041
## EP_SNGPNT 1.536e-02 8.527e-03 1.802 0.0810 .
## EP_MUNIT 7.900e-03 3.081e-03 2.564 0.0152 *
## EP_MOBILE 5.645e-03 3.180e-03 1.775 0.0854 .
## EP_CROWD 1.146e-04 8.008e-03 0.014 0.9887
## EP_NOVEH -4.033e-03 4.701e-03 -0.858 0.3973
## EP_GROUPQ -1.415e-03 3.753e-03 -0.377 0.7087
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05882 on 32 degrees of freedom
## Multiple R-squared: 0.86, Adjusted R-squared: 0.7944
## F-statistic: 13.11 on 15 and 32 DF, p-value: 1.188e-09
# Score (R-squared)
print(summary(lm_model1)$r.squared)
## [1] 0.8600357
# Coefficients
confirmed_cases_all_counties_coef <- data.frame(Coef = coef(lm_model1)[-1], Features = names(coef(lm_model1)[-1]))
confirmed_cases_all_counties_coef <- confirmed_cases_all_counties_coef[order(-confirmed_cases_all_counties_coef$Coef), ]
The comprehensive model incorporates all variables under consideration, including confirmed COVID-19 cases. It demonstrates a robust multiple R-squared value of 0.86, elucidating that approximately 86% of the variance in the Social Vulnerability Index (SVI) score, denoted as RPL_THEMES, is accounted for by this model. This high degree of explained variability signifies the model’s strong predictive capacity.
Upon examining the individual contributions of each predictor, we observe a spectrum of significance levels. Notably, the variable EP_MUNIT, representing the prevalence of multi-unit housing, emerges as a statistically significant predictor at the 0.05 level, indicative of a substantial correlation with the SVI score. Additionally, variables such as EP_POV150 (poverty indicator), EP_SNGPNT (single-parent households), and EP_MOBILE (mobile home residency) exhibit potential significance with p-values close to 0.1, suggesting their probable but less definitive influence on the SVI score.
Conversely, the coefficient pertaining to confirmed COVID-19 cases does not reach a level of statistical significance. This absence of a clear linear association within the context of the model implies that while confirmed cases are an integral aspect of the public health landscape, they do not necessarily predict the SVI score directly when other variables are considered. This insight underscores the multifaceted nature of social vulnerability and the intricate interplay of various determinants.
Feature Selection
A stepwise feature selection process based on the Akaike information criterion (AIC) is applied to refine the model. The goal is to identify a subset of variables that provides a parsimonious model without compromising the explanatory power. The feature selection process sequentially removes variables that contribute the least to the model (e.g., EP_CROWD is removed first). The final model is more streamlined and focuses on predictors that contribute more significantly to the variation in confirmed cases.
# Stepwise feature selection
lm_model2 <- step(lm_model1, direction = "both")
## Start: AIC=-259.45
## y_train ~ confirmed_cases + EP_POV150 + EP_LIMENG + EP_AGE65 +
## EP_MINRTY + EP_UNEMP + EP_NOHSDP + EP_AGE17 + EP_DISABL +
## EP_SNGPNT + EP_MUNIT + EP_MOBILE + EP_CROWD + EP_NOVEH +
## EP_GROUPQ
##
## Df Sum of Sq RSS AIC
## - EP_CROWD 1 0.0000007 0.11072 -261.45
## - EP_AGE17 1 0.0001795 0.11090 -261.38
## - EP_GROUPQ 1 0.0004916 0.11121 -261.24
## - EP_DISABL 1 0.0009488 0.11167 -261.04
## - EP_MINRTY 1 0.0012011 0.11192 -260.94
## - confirmed_cases 1 0.0013401 0.11206 -260.88
## - EP_AGE65 1 0.0024803 0.11320 -260.39
## - EP_NOVEH 1 0.0025468 0.11327 -260.36
## - EP_LIMENG 1 0.0038386 0.11456 -259.82
## - EP_UNEMP 1 0.0040099 0.11473 -259.75
## <none> 0.11072 -259.45
## - EP_NOHSDP 1 0.0098593 0.12058 -257.36
## - EP_MOBILE 1 0.0109020 0.12162 -256.94
## - EP_SNGPNT 1 0.0112298 0.12195 -256.82
## - EP_POV150 1 0.0123617 0.12308 -256.37
## - EP_MUNIT 1 0.0227473 0.13347 -252.48
##
## Step: AIC=-261.45
## y_train ~ confirmed_cases + EP_POV150 + EP_LIMENG + EP_AGE65 +
## EP_MINRTY + EP_UNEMP + EP_NOHSDP + EP_AGE17 + EP_DISABL +
## EP_SNGPNT + EP_MUNIT + EP_MOBILE + EP_NOVEH + EP_GROUPQ
##
## Df Sum of Sq RSS AIC
## - EP_AGE17 1 0.0001788 0.11090 -263.38
## - EP_GROUPQ 1 0.0005004 0.11122 -263.24
## - EP_DISABL 1 0.0009537 0.11168 -263.04
## - EP_MINRTY 1 0.0013562 0.11208 -262.87
## - confirmed_cases 1 0.0014274 0.11215 -262.84
## - EP_AGE65 1 0.0024822 0.11321 -262.39
## - EP_NOVEH 1 0.0025720 0.11330 -262.35
## - EP_LIMENG 1 0.0039575 0.11468 -261.77
## - EP_UNEMP 1 0.0040167 0.11474 -261.74
## <none> 0.11072 -261.45
## + EP_CROWD 1 0.0000007 0.11072 -259.45
## - EP_NOHSDP 1 0.0098658 0.12059 -259.36
## - EP_MOBILE 1 0.0110205 0.12174 -258.90
## - EP_SNGPNT 1 0.0113076 0.12203 -258.79
## - EP_POV150 1 0.0130786 0.12380 -258.09
## - EP_MUNIT 1 0.0227910 0.13351 -254.47
##
## Step: AIC=-263.38
## y_train ~ confirmed_cases + EP_POV150 + EP_LIMENG + EP_AGE65 +
## EP_MINRTY + EP_UNEMP + EP_NOHSDP + EP_DISABL + EP_SNGPNT +
## EP_MUNIT + EP_MOBILE + EP_NOVEH + EP_GROUPQ
##
## Df Sum of Sq RSS AIC
## - EP_GROUPQ 1 0.0009348 0.11184 -264.97
## - EP_DISABL 1 0.0009998 0.11190 -264.94
## - EP_MINRTY 1 0.0013309 0.11223 -264.80
## - confirmed_cases 1 0.0021979 0.11310 -264.43
## - EP_AGE65 1 0.0028648 0.11377 -264.15
## <none> 0.11090 -263.38
## - EP_LIMENG 1 0.0051724 0.11607 -263.19
## - EP_UNEMP 1 0.0055187 0.11642 -263.04
## - EP_NOVEH 1 0.0055872 0.11649 -263.02
## + EP_AGE17 1 0.0001788 0.11072 -261.45
## + EP_CROWD 1 0.0000000 0.11090 -261.38
## - EP_SNGPNT 1 0.0111661 0.12207 -260.77
## - EP_MOBILE 1 0.0115235 0.12243 -260.63
## - EP_POV150 1 0.0142205 0.12512 -259.58
## - EP_NOHSDP 1 0.0167041 0.12761 -258.64
## - EP_MUNIT 1 0.0246999 0.13560 -255.72
##
## Step: AIC=-264.97
## y_train ~ confirmed_cases + EP_POV150 + EP_LIMENG + EP_AGE65 +
## EP_MINRTY + EP_UNEMP + EP_NOHSDP + EP_DISABL + EP_SNGPNT +
## EP_MUNIT + EP_MOBILE + EP_NOVEH
##
## Df Sum of Sq RSS AIC
## - EP_DISABL 1 0.0010079 0.11284 -266.54
## - EP_MINRTY 1 0.0017544 0.11359 -266.23
## - confirmed_cases 1 0.0020433 0.11388 -266.10
## - EP_AGE65 1 0.0030932 0.11493 -265.66
## <none> 0.11184 -264.97
## - EP_LIMENG 1 0.0048490 0.11668 -264.94
## - EP_UNEMP 1 0.0050493 0.11689 -264.85
## - EP_NOVEH 1 0.0055482 0.11738 -264.65
## + EP_GROUPQ 1 0.0009348 0.11090 -263.38
## + EP_AGE17 1 0.0006132 0.11122 -263.24
## + EP_CROWD 1 0.0000043 0.11183 -262.97
## - EP_MOBILE 1 0.0107699 0.12261 -262.56
## - EP_SNGPNT 1 0.0107766 0.12261 -262.56
## - EP_POV150 1 0.0155872 0.12742 -260.71
## - EP_NOHSDP 1 0.0158809 0.12772 -260.60
## - EP_MUNIT 1 0.0237719 0.13561 -257.72
##
## Step: AIC=-266.54
## y_train ~ confirmed_cases + EP_POV150 + EP_LIMENG + EP_AGE65 +
## EP_MINRTY + EP_UNEMP + EP_NOHSDP + EP_SNGPNT + EP_MUNIT +
## EP_MOBILE + EP_NOVEH
##
## Df Sum of Sq RSS AIC
## - confirmed_cases 1 0.0017218 0.11457 -267.81
## - EP_MINRTY 1 0.0018261 0.11467 -267.77
## - EP_NOVEH 1 0.0045410 0.11739 -266.65
## <none> 0.11284 -266.54
## - EP_AGE65 1 0.0053206 0.11817 -266.33
## - EP_UNEMP 1 0.0080345 0.12088 -265.24
## + EP_DISABL 1 0.0010079 0.11184 -264.97
## + EP_GROUPQ 1 0.0009429 0.11190 -264.94
## + EP_AGE17 1 0.0006913 0.11215 -264.84
## + EP_CROWD 1 0.0000018 0.11284 -264.54
## - EP_SNGPNT 1 0.0102151 0.12306 -264.38
## - EP_LIMENG 1 0.0102823 0.12313 -264.36
## - EP_MOBILE 1 0.0121526 0.12500 -263.63
## - EP_POV150 1 0.0154489 0.12829 -262.38
## - EP_MUNIT 1 0.0227824 0.13563 -259.71
## - EP_NOHSDP 1 0.0230720 0.13592 -259.61
##
## Step: AIC=-267.81
## y_train ~ EP_POV150 + EP_LIMENG + EP_AGE65 + EP_MINRTY + EP_UNEMP +
## EP_NOHSDP + EP_SNGPNT + EP_MUNIT + EP_MOBILE + EP_NOVEH
##
## Df Sum of Sq RSS AIC
## - EP_MINRTY 1 0.0018962 0.11646 -269.03
## - EP_NOVEH 1 0.0033541 0.11792 -268.43
## - EP_AGE65 1 0.0044758 0.11904 -267.98
## <none> 0.11457 -267.81
## + confirmed_cases 1 0.0017218 0.11284 -266.54
## + EP_AGE17 1 0.0015160 0.11305 -266.45
## + EP_GROUPQ 1 0.0007971 0.11377 -266.15
## - EP_SNGPNT 1 0.0091792 0.12374 -266.12
## + EP_DISABL 1 0.0006863 0.11388 -266.10
## - EP_UNEMP 1 0.0093913 0.12396 -266.03
## + EP_CROWD 1 0.0002027 0.11436 -265.90
## - EP_MOBILE 1 0.0111971 0.12576 -265.34
## - EP_LIMENG 1 0.0117731 0.12634 -265.12
## - EP_POV150 1 0.0146330 0.12920 -264.05
## - EP_MUNIT 1 0.0214240 0.13599 -261.59
## - EP_NOHSDP 1 0.0225704 0.13714 -261.18
##
## Step: AIC=-269.03
## y_train ~ EP_POV150 + EP_LIMENG + EP_AGE65 + EP_UNEMP + EP_NOHSDP +
## EP_SNGPNT + EP_MUNIT + EP_MOBILE + EP_NOVEH
##
## Df Sum of Sq RSS AIC
## - EP_AGE65 1 0.002585 0.11905 -269.97
## <none> 0.11646 -269.03
## - EP_NOVEH 1 0.005665 0.12213 -268.75
## + EP_MINRTY 1 0.001896 0.11457 -267.81
## + confirmed_cases 1 0.001792 0.11467 -267.77
## + EP_AGE17 1 0.001676 0.11479 -267.72
## + EP_GROUPQ 1 0.001209 0.11525 -267.53
## + EP_DISABL 1 0.000741 0.11572 -267.33
## - EP_MOBILE 1 0.009960 0.12642 -267.09
## - EP_LIMENG 1 0.009978 0.12644 -267.08
## + EP_CROWD 1 0.000000 0.11646 -267.03
## - EP_SNGPNT 1 0.010319 0.12678 -266.95
## - EP_POV150 1 0.012938 0.12940 -265.97
## - EP_UNEMP 1 0.020313 0.13678 -263.31
## - EP_NOHSDP 1 0.026480 0.14294 -261.19
## - EP_MUNIT 1 0.038744 0.15521 -257.24
##
## Step: AIC=-269.97
## y_train ~ EP_POV150 + EP_LIMENG + EP_UNEMP + EP_NOHSDP + EP_SNGPNT +
## EP_MUNIT + EP_MOBILE + EP_NOVEH
##
## Df Sum of Sq RSS AIC
## <none> 0.11905 -269.97
## - EP_NOVEH 1 0.005262 0.12431 -269.90
## + EP_AGE65 1 0.002585 0.11646 -269.03
## + EP_DISABL 1 0.002296 0.11675 -268.91
## - EP_SNGPNT 1 0.007954 0.12700 -268.87
## + EP_GROUPQ 1 0.001131 0.11792 -268.43
## + confirmed_cases 1 0.000876 0.11817 -268.33
## + EP_AGE17 1 0.000069 0.11898 -268.00
## + EP_CROWD 1 0.000038 0.11901 -267.99
## + EP_MINRTY 1 0.000005 0.11904 -267.98
## - EP_MOBILE 1 0.010645 0.12969 -267.86
## - EP_POV150 1 0.011651 0.13070 -267.49
## - EP_LIMENG 1 0.012139 0.13119 -267.31
## - EP_NOHSDP 1 0.024100 0.14315 -263.12
## - EP_UNEMP 1 0.034478 0.15352 -259.76
## - EP_MUNIT 1 0.036164 0.15521 -259.24
Model after feature selection
# Display the selected model
summary(lm_model2)
##
## Call:
## lm(formula = y_train ~ EP_POV150 + EP_LIMENG + EP_UNEMP + EP_NOHSDP +
## EP_SNGPNT + EP_MUNIT + EP_MOBILE + EP_NOVEH, data = cbind(y_train,
## X_train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.105153 -0.028649 -0.001766 0.031433 0.102611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.096140 0.043226 2.224 0.03200 *
## EP_POV150 0.005710 0.002923 1.954 0.05793 .
## EP_LIMENG -0.011330 0.005681 -1.994 0.05316 .
## EP_UNEMP 0.020010 0.005954 3.361 0.00175 **
## EP_NOHSDP 0.010552 0.003755 2.810 0.00771 **
## EP_SNGPNT 0.011662 0.007224 1.614 0.11453
## EP_MUNIT 0.007165 0.002082 3.442 0.00139 **
## EP_MOBILE 0.004914 0.002631 1.867 0.06937 .
## EP_NOVEH -0.003763 0.002866 -1.313 0.19688
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05525 on 39 degrees of freedom
## Multiple R-squared: 0.8495, Adjusted R-squared: 0.8186
## F-statistic: 27.52 on 8 and 39 DF, p-value: 9.513e-14
# Get the selected features
selected_features <- names(coef(lm_model2)[-1])
selected_features
## [1] "EP_POV150" "EP_LIMENG" "EP_UNEMP" "EP_NOHSDP" "EP_SNGPNT" "EP_MUNIT"
## [7] "EP_MOBILE" "EP_NOVEH"
Observation:
The refined model, referred to as lm_model2, retains a robust explanatory capacity with a Multiple R-squared of 0.8506, mirroring the initial model’s explanatory power. This is particularly noteworthy as it was achieved with a reduced set of predictors, indicating that the retained variables are indeed pivotal in explaining the variance in the social vulnerability index.
The Adjusted R-squared value of our refined model saw a marginal increase from the original model, suggesting that the model’s efficiency has been enhanced when adjusting for the number of predictors. This underscores the effectiveness of the feature selection process in isolating the most influential variables while eschewing redundant or less informative ones.
Statistical analysis of lm_model2 reveals that variables such as unemployment rates, lack of high school diploma attainment, and multi-unit housing remain statistically significant, reinforcing their roles as critical indicators of social vulnerability in the context of the pandemic.
A decrease in the residual standard error in lm_model2 compared to the initial model suggests a tighter fit to the data, enhancing the predictive accuracy of the model.
The refined model presents an optimal balance between model complexity and explanatory power, facilitating a nuanced understanding of the factors contributing to social vulnerability during the COVID-19 pandemic. This balance is crucial in the scientific pursuit of parsimony, ensuring that the model is not overly complex while still capturing the essential dynamics at play.
Initial MLR for deaths
library(caret)
# Preparing the independent variable (X) and the dependent variable (y)
y <- svi_all_counties_deaths$RPL_THEMES
X <- svi_all_counties_deaths[, c('deaths','EP_POV150','EP_LIMENG','EP_AGE65', 'EP_MINRTY', 'EP_UNEMP', 'EP_NOHSDP','EP_AGE17','EP_DISABL','EP_SNGPNT','EP_MUNIT','EP_MOBILE','EP_CROWD','EP_NOVEH','EP_GROUPQ')]
#X <- svi_all_counties_deaths[, c('EP_LIMENG','EP_NOHSDP', 'EP_AGE17','EP_MUNIT','EP_CROWD','EP_NOVEH','EP_GROUPQ','EP_DISABL')]
# Splitting the data into train and test data
set.seed(42) # For reproducibility
train_indices <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[train_indices, ]
X_test <- X[-train_indices, ]
y_train <- y[train_indices]
y_test <- y[-train_indices]
# Initiate the model
lm_model3 <- lm(y_train ~ ., data = cbind(y_train, X_train))
# Predictions on the test set
y_predict <- predict(lm_model3, newdata = X_test)
summary(lm_model3)
##
## Call:
## lm(formula = y_train ~ ., data = cbind(y_train, X_train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.133460 -0.033317 0.001034 0.037791 0.105024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.124e-01 2.181e-01 -0.515 0.6099
## deaths -1.330e-07 3.570e-07 -0.373 0.7119
## EP_POV150 7.487e-03 3.713e-03 2.016 0.0522 .
## EP_LIMENG -2.473e-03 8.619e-03 -0.287 0.7760
## EP_AGE65 5.114e-03 5.406e-03 0.946 0.3512
## EP_MINRTY 1.191e-03 1.925e-03 0.619 0.5405
## EP_UNEMP 4.710e-03 9.641e-03 0.489 0.6285
## EP_NOHSDP 8.146e-03 6.664e-03 1.222 0.2305
## EP_AGE17 2.942e-03 9.066e-03 0.324 0.7477
## EP_DISABL 4.634e-03 7.012e-03 0.661 0.5134
## EP_SNGPNT 1.514e-02 9.044e-03 1.674 0.1038
## EP_MUNIT 6.876e-03 3.447e-03 1.995 0.0546 .
## EP_MOBILE 5.193e-03 3.351e-03 1.550 0.1310
## EP_CROWD -6.547e-03 8.107e-03 -0.808 0.4253
## EP_NOVEH -1.696e-03 1.085e-02 -0.156 0.8767
## EP_GROUPQ -2.268e-03 4.252e-03 -0.533 0.5974
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06291 on 32 degrees of freedom
## Multiple R-squared: 0.8377, Adjusted R-squared: 0.7616
## F-statistic: 11.01 on 15 and 32 DF, p-value: 1.087e-08
# Score (R-squared)
print(summary(lm_model3)$r.squared)
## [1] 0.837654
# Coefficients
deaths_all_counties_coef <- data.frame(Coef = coef(lm_model3)[-1], Features = names(coef(lm_model3)[-1]))
deaths_all_counties_coef <- deaths_all_counties_coef[order(-deaths_all_counties_coef$Coef), ]
The model’s summary reveals that while the Multiple R-squared
value is 0.8377 indicating that approximately 83.77% of the variance in
the SVI score is captured by the model, the individual predictors’
contributions vary in statistical significance. Notably,
EP_POV150 (poverty) shows a p-value close to the
traditional threshold of statistical significance, suggesting a
meaningful association with the SVI score. EP_MUNIT
(multi-unit housing) also approaches significance, potentially
reflecting the impact of housing conditions on social
vulnerability.
However, the variable deaths does not exhibit a
significant association with the SVI score within the context of this
model, suggesting that the number of deaths alone is not a direct
predictor of social vulnerability as encapsulated by the
RPL_THEMES score when considering the influence of the
other variables.
The model yields a residual standard error of 0.06291, and despite a substantial F-statistic, we need to take caution in the interpretation of non-significant variables. The Adjusted R-squared of 0.7616 reflects the model’s explanatory power after accounting for the number of predictors, indicating a good fit to the data while acknowledging the reduced number of variables.
Feature Selection for deaths MLR
# Stepwise feature selection
lm_model4 <- step(lm_model3, direction = "both")
## Start: AIC=-253
## y_train ~ deaths + EP_POV150 + EP_LIMENG + EP_AGE65 + EP_MINRTY +
## EP_UNEMP + EP_NOHSDP + EP_AGE17 + EP_DISABL + EP_SNGPNT +
## EP_MUNIT + EP_MOBILE + EP_CROWD + EP_NOVEH + EP_GROUPQ
##
## Df Sum of Sq RSS AIC
## - EP_NOVEH 1 0.0000968 0.12675 -254.96
## - EP_LIMENG 1 0.0003258 0.12698 -254.88
## - EP_AGE17 1 0.0004167 0.12707 -254.84
## - deaths 1 0.0005495 0.12720 -254.79
## - EP_UNEMP 1 0.0009447 0.12760 -254.64
## - EP_GROUPQ 1 0.0011261 0.12778 -254.57
## - EP_MINRTY 1 0.0015151 0.12817 -254.43
## - EP_DISABL 1 0.0017286 0.12838 -254.35
## - EP_CROWD 1 0.0025812 0.12923 -254.03
## - EP_AGE65 1 0.0035418 0.13020 -253.68
## <none> 0.12665 -253.00
## - EP_NOHSDP 1 0.0059131 0.13257 -252.81
## - EP_MOBILE 1 0.0095060 0.13616 -251.53
## - EP_SNGPNT 1 0.0110964 0.13775 -250.97
## - EP_MUNIT 1 0.0157516 0.14240 -249.37
## - EP_POV150 1 0.0160879 0.14274 -249.26
##
## Step: AIC=-254.96
## y_train ~ deaths + EP_POV150 + EP_LIMENG + EP_AGE65 + EP_MINRTY +
## EP_UNEMP + EP_NOHSDP + EP_AGE17 + EP_DISABL + EP_SNGPNT +
## EP_MUNIT + EP_MOBILE + EP_CROWD + EP_GROUPQ
##
## Df Sum of Sq RSS AIC
## - EP_LIMENG 1 0.0004570 0.12721 -256.79
## - EP_AGE17 1 0.0005486 0.12730 -256.76
## - deaths 1 0.0005919 0.12734 -256.74
## - EP_UNEMP 1 0.0008853 0.12763 -256.63
## - EP_GROUPQ 1 0.0011276 0.12788 -256.54
## - EP_MINRTY 1 0.0016409 0.12839 -256.35
## - EP_DISABL 1 0.0017022 0.12845 -256.32
## - EP_CROWD 1 0.0027816 0.12953 -255.92
## <none> 0.12675 -254.96
## - EP_AGE65 1 0.0057446 0.13249 -254.84
## - EP_NOHSDP 1 0.0064980 0.13325 -254.56
## - EP_MOBILE 1 0.0094523 0.13620 -253.51
## + EP_NOVEH 1 0.0000968 0.12665 -253.00
## - EP_SNGPNT 1 0.0110474 0.13780 -252.95
## - EP_MUNIT 1 0.0164093 0.14316 -251.12
## - EP_POV150 1 0.0165064 0.14326 -251.09
##
## Step: AIC=-256.79
## y_train ~ deaths + EP_POV150 + EP_AGE65 + EP_MINRTY + EP_UNEMP +
## EP_NOHSDP + EP_AGE17 + EP_DISABL + EP_SNGPNT + EP_MUNIT +
## EP_MOBILE + EP_CROWD + EP_GROUPQ
##
## Df Sum of Sq RSS AIC
## - deaths 1 0.0004491 0.12766 -258.62
## - EP_UNEMP 1 0.0006280 0.12784 -258.56
## - EP_GROUPQ 1 0.0008911 0.12810 -258.46
## - EP_MINRTY 1 0.0012334 0.12844 -258.33
## - EP_AGE17 1 0.0013211 0.12853 -258.30
## - EP_DISABL 1 0.0026955 0.12990 -257.78
## - EP_CROWD 1 0.0030686 0.13028 -257.65
## <none> 0.12721 -256.79
## - EP_AGE65 1 0.0056972 0.13290 -256.69
## - EP_NOHSDP 1 0.0089397 0.13615 -255.53
## - EP_MOBILE 1 0.0104275 0.13763 -255.01
## + EP_LIMENG 1 0.0004570 0.12675 -254.96
## + EP_NOVEH 1 0.0002279 0.12698 -254.88
## - EP_SNGPNT 1 0.0118908 0.13910 -254.50
## - EP_MUNIT 1 0.0161630 0.14337 -253.05
## - EP_POV150 1 0.0167628 0.14397 -252.85
##
## Step: AIC=-258.62
## y_train ~ EP_POV150 + EP_AGE65 + EP_MINRTY + EP_UNEMP + EP_NOHSDP +
## EP_AGE17 + EP_DISABL + EP_SNGPNT + EP_MUNIT + EP_MOBILE +
## EP_CROWD + EP_GROUPQ
##
## Df Sum of Sq RSS AIC
## - EP_UNEMP 1 0.0006583 0.12831 -260.38
## - EP_GROUPQ 1 0.0006676 0.12832 -260.37
## - EP_MINRTY 1 0.0013009 0.12896 -260.13
## - EP_DISABL 1 0.0024034 0.13006 -259.73
## - EP_AGE17 1 0.0027035 0.13036 -259.62
## - EP_CROWD 1 0.0034957 0.13115 -259.32
## <none> 0.12766 -258.62
## - EP_AGE65 1 0.0061289 0.13378 -258.37
## - EP_NOHSDP 1 0.0087543 0.13641 -257.44
## - EP_MOBILE 1 0.0104576 0.13811 -256.84
## + deaths 1 0.0004491 0.12721 -256.79
## + EP_LIMENG 1 0.0003142 0.12734 -256.74
## + EP_NOVEH 1 0.0002525 0.12740 -256.72
## - EP_SNGPNT 1 0.0115922 0.13925 -256.45
## - EP_POV150 1 0.0164841 0.14414 -254.79
## - EP_MUNIT 1 0.0175708 0.14523 -254.43
##
## Step: AIC=-260.37
## y_train ~ EP_POV150 + EP_AGE65 + EP_MINRTY + EP_NOHSDP + EP_AGE17 +
## EP_DISABL + EP_SNGPNT + EP_MUNIT + EP_MOBILE + EP_CROWD +
## EP_GROUPQ
##
## Df Sum of Sq RSS AIC
## - EP_GROUPQ 1 0.000375 0.12869 -262.24
## - EP_DISABL 1 0.002946 0.13126 -261.29
## - EP_MINRTY 1 0.003094 0.13141 -261.23
## - EP_CROWD 1 0.003099 0.13141 -261.23
## - EP_AGE17 1 0.003510 0.13182 -261.08
## <none> 0.12831 -260.38
## - EP_NOHSDP 1 0.008406 0.13672 -259.33
## - EP_MOBILE 1 0.009993 0.13831 -258.77
## + EP_UNEMP 1 0.000658 0.12766 -258.62
## + deaths 1 0.000479 0.12784 -258.56
## + EP_NOVEH 1 0.000115 0.12820 -258.42
## + EP_LIMENG 1 0.000109 0.12821 -258.42
## - EP_AGE65 1 0.011418 0.13973 -258.28
## - EP_SNGPNT 1 0.012217 0.14053 -258.01
## - EP_MUNIT 1 0.016944 0.14526 -256.42
## - EP_POV150 1 0.034705 0.16302 -250.88
##
## Step: AIC=-262.23
## y_train ~ EP_POV150 + EP_AGE65 + EP_MINRTY + EP_NOHSDP + EP_AGE17 +
## EP_DISABL + EP_SNGPNT + EP_MUNIT + EP_MOBILE + EP_CROWD
##
## Df Sum of Sq RSS AIC
## - EP_DISABL 1 0.002736 0.13142 -263.23
## - EP_CROWD 1 0.002832 0.13152 -263.19
## - EP_MINRTY 1 0.003188 0.13188 -263.06
## - EP_AGE17 1 0.004447 0.13314 -262.60
## <none> 0.12869 -262.24
## - EP_NOHSDP 1 0.008032 0.13672 -261.33
## - EP_MOBILE 1 0.009635 0.13832 -260.77
## + EP_GROUPQ 1 0.000375 0.12831 -260.38
## + EP_UNEMP 1 0.000366 0.12832 -260.37
## + deaths 1 0.000280 0.12841 -260.34
## + EP_NOVEH 1 0.000112 0.12858 -260.28
## + EP_LIMENG 1 0.000072 0.12862 -260.26
## - EP_SNGPNT 1 0.012046 0.14073 -259.94
## - EP_AGE65 1 0.012750 0.14144 -259.70
## - EP_MUNIT 1 0.016642 0.14533 -258.40
## - EP_POV150 1 0.034669 0.16336 -252.78
##
## Step: AIC=-263.22
## y_train ~ EP_POV150 + EP_AGE65 + EP_MINRTY + EP_NOHSDP + EP_AGE17 +
## EP_SNGPNT + EP_MUNIT + EP_MOBILE + EP_CROWD
##
## Df Sum of Sq RSS AIC
## - EP_MINRTY 1 0.004196 0.13562 -263.72
## - EP_AGE17 1 0.004452 0.13588 -263.63
## - EP_CROWD 1 0.004891 0.13632 -263.47
## <none> 0.13142 -263.23
## - EP_NOHSDP 1 0.008242 0.13967 -262.31
## + EP_DISABL 1 0.002736 0.12869 -262.24
## + EP_UNEMP 1 0.000853 0.13057 -261.54
## + EP_LIMENG 1 0.000655 0.13077 -261.46
## + EP_GROUPQ 1 0.000165 0.13126 -261.29
## + EP_NOVEH 1 0.000115 0.13131 -261.27
## + deaths 1 0.000088 0.13134 -261.26
## - EP_MOBILE 1 0.012159 0.14358 -260.98
## - EP_SNGPNT 1 0.012482 0.14391 -260.87
## - EP_MUNIT 1 0.014936 0.14636 -260.06
## - EP_AGE65 1 0.029925 0.16135 -255.38
## - EP_POV150 1 0.050666 0.18209 -249.57
##
## Step: AIC=-263.72
## y_train ~ EP_POV150 + EP_AGE65 + EP_NOHSDP + EP_AGE17 + EP_SNGPNT +
## EP_MUNIT + EP_MOBILE + EP_CROWD
##
## Df Sum of Sq RSS AIC
## - EP_CROWD 1 0.002397 0.13802 -264.88
## <none> 0.13562 -263.72
## + EP_MINRTY 1 0.004196 0.13142 -263.23
## + EP_DISABL 1 0.003745 0.13188 -263.06
## + EP_UNEMP 1 0.003221 0.13240 -262.87
## - EP_AGE17 1 0.010298 0.14592 -262.20
## - EP_MOBILE 1 0.011287 0.14691 -261.88
## + EP_NOVEH 1 0.000344 0.13528 -261.84
## + EP_GROUPQ 1 0.000206 0.13542 -261.79
## + deaths 1 0.000134 0.13549 -261.76
## + EP_LIMENG 1 0.000042 0.13558 -261.73
## - EP_NOHSDP 1 0.012108 0.14773 -261.61
## - EP_SNGPNT 1 0.015806 0.15143 -260.43
## - EP_AGE65 1 0.027283 0.16290 -256.92
## - EP_MUNIT 1 0.027540 0.16316 -256.84
## - EP_POV150 1 0.046767 0.18239 -251.50
##
## Step: AIC=-264.88
## y_train ~ EP_POV150 + EP_AGE65 + EP_NOHSDP + EP_AGE17 + EP_SNGPNT +
## EP_MUNIT + EP_MOBILE
##
## Df Sum of Sq RSS AIC
## <none> 0.13802 -264.88
## + EP_DISABL 1 0.005031 0.13299 -264.66
## + EP_CROWD 1 0.002397 0.13562 -263.72
## - EP_NOHSDP 1 0.009711 0.14773 -263.61
## + EP_UNEMP 1 0.001837 0.13618 -263.52
## + EP_MINRTY 1 0.001702 0.13632 -263.47
## + deaths 1 0.000344 0.13767 -263.00
## + EP_LIMENG 1 0.000163 0.13786 -262.93
## + EP_NOVEH 1 0.000124 0.13789 -262.92
## + EP_GROUPQ 1 0.000009 0.13801 -262.88
## - EP_AGE17 1 0.012323 0.15034 -262.77
## - EP_SNGPNT 1 0.014690 0.15271 -262.02
## - EP_MOBILE 1 0.016171 0.15419 -261.56
## - EP_MUNIT 1 0.026213 0.16423 -258.53
## - EP_AGE65 1 0.033740 0.17176 -256.38
## - EP_POV150 1 0.044386 0.18240 -253.49
Model after feature selection
# Display the selected model
summary(lm_model4)
##
## Call:
## lm(formula = y_train ~ EP_POV150 + EP_AGE65 + EP_NOHSDP + EP_AGE17 +
## EP_SNGPNT + EP_MUNIT + EP_MOBILE, data = cbind(y_train, X_train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.126798 -0.027921 0.001379 0.025232 0.109946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.241102 0.144877 -1.664 0.103892
## EP_POV150 0.008180 0.002281 3.587 0.000902 ***
## EP_AGE65 0.008853 0.002831 3.127 0.003286 **
## EP_NOHSDP 0.004919 0.002932 1.678 0.101216
## EP_AGE17 0.010534 0.005574 1.890 0.066049 .
## EP_SNGPNT 0.016504 0.007998 2.063 0.045608 *
## EP_MUNIT 0.005614 0.002037 2.756 0.008763 **
## EP_MOBILE 0.006032 0.002786 2.165 0.036422 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05874 on 40 degrees of freedom
## Multiple R-squared: 0.8231, Adjusted R-squared: 0.7921
## F-statistic: 26.59 on 7 and 40 DF, p-value: 3.772e-13
# Get the selected features
selected_features <- names(coef(lm_model4)[-1])
selected_features
## [1] "EP_POV150" "EP_AGE65" "EP_NOHSDP" "EP_AGE17" "EP_SNGPNT" "EP_MUNIT"
## [7] "EP_MOBILE"
The lm_model4 model after feature selection includes the predictors “EP_POV150,” “EP_AGE65,” “EP_NOHSDP,” “EP_AGE17,” “EP_SNGPNT,” “EP_MUNIT,” and “EP_MOBILE,” which have been identified as the most influential on the Social Vulnerability Index (SVI) score in the context of COVID-19 related deaths.
The lm_model4 (model after feature selection) displays a Multiple
R-squared value of 0.8231. This suggests that the model explains over
82% of the variance in the SVI score, a slight decrement from the 0.8377
observed in the full model lm_model3. However, the Adjusted
R-squared value has improved to 0.7921, reinforcing the model’s
robustness when adjusted for the number of predictors included.
Each variable within the lm_model4 has been evaluated for its statistical significance. Variables such as “EP_POV150” and “EP_AGE65” have shown strong significance, implying a robust relationship with the SVI score.
It is noteworthy that the residual standard error of the lm_model4 remains comparable to that of the full model, reinforcing the accuracy of this more focused model. The F-statistic’s substantial value corroborates the overall statistical validity of the lm_model4 model.
In summation, the model after feature selection represents an optimized model that balances complexity with interpretability, achieving a notable degree of explanatory power with a reduced set of variables.
In our pursuit to comprehensively understand the determinants of Social Vulnerability Index (SVI), we merged three distinct datasets—SVI dataset, Economy dataset, and COVID dataset—leveraging a common linkage through ‘county_fips’ and ‘Geo_ID.’ This synergistic integration aimed to harness the collective insights encapsulated in socio-economic indicators, mean income statistics, and COVID-19-related metrics at the county level.
SVI Dataset:
The SVI dataset contributes a plethora of socio-economic indicators (SVI variables) that encapsulate vulnerability dimensions such as education, income, housing, and health. These variables serve as crucial predictors for our modeling endeavor.
Economy Dataset:
The Economy dataset brings forth essential economic indicators, with a focal point on mean income. Mean income is a pivotal economic factor that intersects with the socio-economic fabric, influencing vulnerability levels in diverse communities.
COVID Dataset:
The COVID dataset enriches our amalgamation with pertinent health-related variables, namely ‘total_cases’ and ‘total_deaths.’ Inclusion of these COVID-19 metrics acknowledges the recent challenges and disruptions posed by the pandemic, providing a contemporary lens to our predictive modeling task.
Our objective is to model and predict the total SVI (RPL_THEMES) based on SVI-related variables, mean income, and COVID-19 metrics. By integrating these multifaceted dimensions, we strive to uncover the intricate relationships that contribute to social vulnerability, taking into account both pre-existing disparities and the impact of the ongoing pandemic.
This predictive modeling initiative holds significant implications for strategic interventions, policy formulation, and community resilience efforts. By unraveling the intricate interplay of socio-economic and health factors, our models aim to provide actionable insights that can inform targeted interventions and support the development of resilient communities.
In the subsequent sections, we delve into the application of Random Forest and Decision Tree models to further dissect these relationships, offering interpretability and predictive accuracy to our intricate dataset.
Before Merging the three datasets we will first clean each dataset and change respective column’s names if needed
SVI_Data <- read.csv("SVI_2020_US.csv")
head(SVI_Data)
## ST STATE ST_ABBR STCNTY COUNTY FIPS
## 1 1 Alabama AL 1001 Autauga 1001020100
## 2 1 Alabama AL 1001 Autauga 1001020200
## 3 1 Alabama AL 1001 Autauga 1001020300
## 4 1 Alabama AL 1001 Autauga 1001020400
## 5 1 Alabama AL 1001 Autauga 1001020501
## 6 1 Alabama AL 1001 Autauga 1001020502
## LOCATION AREA_SQMI E_TOTPOP M_TOTPOP E_HU
## 1 Census Tract 201, Autauga County, Alabama 3.7935695 1941 390 710
## 2 Census Tract 202, Autauga County, Alabama 1.2821745 1757 310 720
## 3 Census Tract 203, Autauga County, Alabama 2.0653642 3694 570 1464
## 4 Census Tract 204, Autauga County, Alabama 2.4649840 3539 500 1741
## 5 Census Tract 205.01, Autauga County, Alabama 2.3952432 4306 662 1786
## 6 Census Tract 205.02, Autauga County, Alabama 0.8098065 3100 531 1285
## M_HU E_HH M_HH E_POV150 M_POV150 E_UNEMP M_UNEMP E_HBURD M_HBURD E_NOHSDP
## 1 120 693 121 352 138 18 18 144 59 187
## 2 99 573 99 384 182 29 26 149 60 139
## 3 180 1351 173 842 317 53 45 264 96 317
## 4 200 1636 213 503 213 39 34 284 100 173
## 5 220 1706 244 903 496 23 31 351 176 251
## 6 212 1285 212 404 305 31 50 236 146 40
## M_NOHSDP E_UNINSUR M_UNINSUR E_AGE65 M_AGE65 E_AGE17 M_AGE17 E_DISABL
## 1 93 187 91 295 101 415 208 413
## 2 59 91 55 284 97 325 110 168
## 3 122 127 85 464 68 929 253 515
## 4 108 169 126 969 270 510 155 690
## 5 171 135 138 541 180 1136 320 594
## 6 60 0 12 401 178 1037 313 280
## M_DISABL E_SNGPNT M_SNGPNT E_LIMENG M_LIMENG E_MINRTY M_MINRTY E_MUNIT
## 1 147 51 31 0 48 437 192 0
## 2 73 21 25 0 48 1116 306 3
## 3 148 152 88 128 133 1331 548 26
## 4 255 85 65 89 130 454 266 143
## 5 204 88 66 52 84 789 358 57
## 6 146 155 129 0 48 943 406 220
## M_MUNIT E_MOBILE M_MOBILE E_CROWD M_CROWD E_NOVEH M_NOVEH E_GROUPQ M_GROUPQ
## 1 16 88 43 0 16 10 12 0 12
## 2 13 5 8 9 16 57 37 212 85
## 3 43 14 18 35 37 42 37 0 12
## 4 115 0 12 10 16 72 93 0 12
## 5 87 29 48 0 16 44 50 11 7
## 6 142 0 12 37 64 0 12 0 12
## EP_POV150 MP_POV150 EP_UNEMP MP_UNEMP EP_HBURD MP_HBURD EP_NOHSDP MP_NOHSDP
## 1 18.1 6.1 2.1 2.1 20.8 7.7 14.3 6.3
## 2 25.4 11.0 4.0 3.5 26.0 9.5 10.6 4.8
## 3 22.8 7.8 2.7 2.3 19.5 6.7 12.8 5.3
## 4 14.2 5.7 2.4 2.0 17.4 5.7 6.2 3.6
## 5 21.0 11.1 1.0 1.3 20.6 9.9 8.8 6.0
## 6 13.0 9.6 2.7 4.2 18.4 10.9 2.5 3.5
## EP_UNINSUR MP_UNINSUR EP_AGE65 MP_AGE65 EP_AGE17 MP_AGE17 EP_DISABL MP_DISABL
## 1 9.6 5.1 15.2 5.1 21.4 9.8 21.3 7.4
## 2 5.9 3.7 16.2 4.6 18.5 5.3 11.0 3.8
## 3 3.5 2.4 12.6 2.5 25.1 5.6 14.0 3.5
## 4 4.8 3.3 27.4 6.5 14.4 3.9 19.6 6.2
## 5 3.2 3.3 12.6 4.6 26.4 6.2 14.0 4.9
## 6 0.0 1.2 12.9 5.5 33.5 8.3 10.0 5.5
## EP_SNGPNT MP_SNGPNT EP_LIMENG MP_LIMENG EP_MINRTY MP_MINRTY EP_MUNIT MP_MUNIT
## 1 7.4 4.3 0.0 2.6 22.5 8.8 0.0 2.3
## 2 3.7 4.3 0.0 2.9 63.5 13.3 0.4 1.8
## 3 11.3 6.4 3.6 3.7 36.0 13.8 1.8 2.9
## 4 5.2 3.9 2.6 3.8 12.8 7.3 8.2 6.5
## 5 5.2 3.8 1.3 2.1 18.3 7.8 3.2 4.9
## 6 12.1 9.8 0.0 1.7 30.4 12.0 17.1 10.7
## EP_MOBILE MP_MOBILE EP_CROWD MP_CROWD EP_NOVEH MP_NOVEH EP_GROUPQ MP_GROUPQ
## 1 12.4 5.8 0.0 2.3 1.4 1.8 0.0 0.6
## 2 0.7 1.1 1.6 2.8 9.9 6.3 12.1 4.3
## 3 1.0 1.2 2.6 2.7 3.1 2.8 0.0 0.3
## 4 0.0 2.0 0.6 1.0 4.4 5.4 0.0 0.3
## 5 1.6 2.7 0.0 0.9 2.6 3.0 0.3 0.2
## 6 0.0 2.7 2.9 5.0 0.0 2.7 0.0 0.4
## EPL_POV150 EPL_UNEMP EPL_HBURD EPL_NOHSDP EPL_UNINSUR SPL_THEME1 RPL_THEME1
## 1 0.4727 0.1731 0.3448 0.6963 0.6529 2.3398 0.4578
## 2 0.6491 0.4210 0.5214 0.5673 0.4320 2.5908 0.5348
## 3 0.5917 0.2487 0.2965 0.6504 0.2399 2.0272 0.3639
## 4 0.3598 0.2096 0.2214 0.3440 0.3486 1.4834 0.2081
## 5 0.5495 0.0641 0.3375 0.4860 0.2142 1.6513 0.2540
## 6 0.3250 0.2487 0.2575 0.1165 0.0000 0.9477 0.0828
## EPL_AGE65 EPL_AGE17 EPL_DISABL EPL_SNGPNT EPL_LIMENG SPL_THEME2 RPL_THEME2
## 1 0.4693 0.4653 0.8926 0.6627 0.0000 2.4899 0.5079
## 2 0.5252 0.2898 0.3984 0.3493 0.0000 1.5627 0.0810
## 3 0.3246 0.6915 0.5985 0.8384 0.7095 3.1625 0.8632
## 4 0.9146 0.1303 0.8487 0.4947 0.6462 3.0345 0.8131
## 5 0.3246 0.7589 0.5985 0.4947 0.5112 2.6879 0.6297
## 6 0.3415 0.9511 0.3274 0.8614 0.0000 2.4814 0.5020
## EPL_MINRTY SPL_THEME3 RPL_THEME3 EPL_MUNIT EPL_MOBILE EPL_CROWD EPL_NOVEH
## 1 0.3921 0.3921 0.3921 0.0000 0.8180 0.0000 0.1872
## 2 0.7610 0.7610 0.7610 0.2798 0.5115 0.4821 0.7387
## 3 0.5529 0.5529 0.5529 0.3930 0.5418 0.6078 0.3651
## 4 0.2386 0.2386 0.2386 0.6117 0.0000 0.3026 0.4731
## 5 0.3313 0.3313 0.3313 0.4630 0.5860 0.0000 0.3173
## 6 0.4923 0.4923 0.4923 0.7559 0.0000 0.6378 0.0000
## EPL_GROUPQ SPL_THEME4 RPL_THEME4 SPL_THEMES RPL_THEMES F_POV150 F_UNEMP
## 1 0.0000 1.0052 0.0945 6.2270 0.2823 0 0
## 2 0.9570 2.9691 0.7915 7.8836 0.5406 0 0
## 3 0.0000 1.9077 0.3500 7.6503 0.5042 0 0
## 4 0.0000 1.3874 0.1759 6.1439 0.2703 0 0
## 5 0.5375 1.9038 0.3484 6.5743 0.3343 0 0
## 6 0.0000 1.3937 0.1776 5.3151 0.1629 0 0
## F_HBURD F_NOHSDP F_UNINSUR F_THEME1 F_AGE65 F_AGE17 F_DISABL F_SNGPNT
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 1 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 1 0 0
## F_LIMENG F_THEME2 F_MINRTY F_THEME3 F_MUNIT F_MOBILE F_CROWD F_NOVEH F_GROUPQ
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 1
## 3 0 0 0 0 0 0 0 0 0
## 4 0 1 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 1 0 0 0 0 0 0 0
## F_THEME4 F_TOTAL E_DAYPOP E_NOINT M_NOINT E_AFAM M_AFAM E_HISP M_HISP E_ASIAN
## 1 0 0 1033 217 544 235 151 33 34 41
## 2 1 1 4080 301 372 1026 298 30 38 0
## 3 0 0 2056 401 795 1042 499 180 207 44
## 4 0 1 1908 335 712 309 242 17 27 17
## 5 0 0 3774 172 954 658 341 36 64 80
## 6 0 1 3437 376 808 232 94 369 217 278
## M_ASIAN E_AIAN M_AIAN E_NHPI M_NHPI E_TWOMORE M_TWOMORE E_OTHERRACE
## 1 54 0 12 0 12 128 99 0
## 2 12 0 12 0 12 46 54 14
## 3 42 0 12 0 12 65 85 0
## 4 21 10 17 0 12 101 103 0
## 5 85 0 12 0 12 15 24 0
## 6 212 0 12 0 12 32 221 32
## M_OTHERRACE EP_NOINT MP_NOINT EP_AFAM MP_AFAM EP_HISP MP_HISP EP_ASIAN
## 1 12 11.2 2.3 12.1 7.1 1.7 1.8 2.1
## 2 19 19.5 3.8 58.4 8.0 1.7 2.1 0.0
## 3 12 10.9 1.7 28.2 10.4 4.9 5.5 1.2
## 4 12 9.5 1.4 8.7 6.5 0.5 0.7 0.5
## 5 12 4.0 0.7 15.3 7.6 0.8 1.5 1.9
## 6 124 12.1 2.1 7.5 3.1 11.9 7.1 9.0
## MP_ASIAN EP_AIAN MP_AIAN EP_NHPI MP_NHPI EP_TWOMORE MP_TWOMORE EP_OTHERRACE
## 1 2.7 0.0 1.8 0 1.8 6.6 5.1 0.0
## 2 2.0 0.0 2.0 0 2.0 2.6 3.0 0.8
## 3 1.1 0.0 0.9 0 0.9 1.8 2.3 0.0
## 4 0.6 0.3 0.5 0 1.0 2.9 2.8 0.0
## 5 2.0 0.0 0.8 0 0.8 0.3 0.6 0.0
## 6 6.6 0.0 1.1 0 1.1 1.0 7.3 1.0
## MP_OTHERRACE
## 1 1.8
## 2 1.1
## 3 0.9
## 4 1.0
## 5 0.8
## 6 4.0
econ <- read.csv("ACSDT5Y2020.B19013-Data.csv")
head(econ)
## GEO_ID NAME
## 1 Geography Geographic Area Name
## 2 0400000US06 California
## 3 1400000US06001400100 Census Tract 4001, Alameda County, California
## 4 1400000US06001400200 Census Tract 4002, Alameda County, California
## 5 1400000US06001400300 Census Tract 4003, Alameda County, California
## 6 1400000US06001400400 Census Tract 4004, Alameda County, California
## B19013_001E
## 1 Estimate!!Median household income in the past 12 months (in 2020 inflation-adjusted dollars)
## 2 78672
## 3 220921
## 4 200192
## 5 118695
## 6 137067
## B19013_001M
## 1 Margin of Error!!Median household income in the past 12 months (in 2020 inflation-adjusted dollars)
## 2 270
## 3 25969
## 4 23361
## 5 23530
## 6 5162
## X
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
us_cases <- read.csv('USAFacts/confirmed-covid-19-cases-in-us-by-state-and-county.csv')
head(us_cases)
## X county_fips county_name state_name state_fips date
## 1 0 0 Statewide Unallocated AL 1 2020-01-22
## 2 1 0 Statewide Unallocated AL 1 2020-01-23
## 3 2 0 Statewide Unallocated AL 1 2020-01-24
## 4 3 0 Statewide Unallocated AL 1 2020-01-25
## 5 4 0 Statewide Unallocated AL 1 2020-01-26
## 6 5 0 Statewide Unallocated AL 1 2020-01-27
## confirmed lat long geometry
## 1 0 NA NA
## 2 0 NA NA
## 3 0 NA NA
## 4 0 NA NA
## 5 0 NA NA
## 6 0 NA NA
us_deaths <- read.csv('USAFacts/confirmed-covid-19-deaths-in-us-by-state-and-county.csv')
tail(us_deaths)
## X county_fips county_name state_name state_fips date deaths
## 600655 600654 56045 Weston County WY 56 2020-07-22 0
## 600656 600655 56045 Weston County WY 56 2020-07-23 0
## 600657 600656 56045 Weston County WY 56 2020-07-24 0
## 600658 600657 56045 Weston County WY 56 2020-07-25 0
## 600659 600658 56045 Weston County WY 56 2020-07-26 0
## 600660 600659 56045 Weston County WY 56 2020-07-27 0
## lat long geometry
## 600655 43.83961 -104.5675 POINT (-104.5674881 43.83961191)
## 600656 43.83961 -104.5675 POINT (-104.5674881 43.83961191)
## 600657 43.83961 -104.5675 POINT (-104.5674881 43.83961191)
## 600658 43.83961 -104.5675 POINT (-104.5674881 43.83961191)
## 600659 43.83961 -104.5675 POINT (-104.5674881 43.83961191)
## 600660 43.83961 -104.5675 POINT (-104.5674881 43.83961191)
Subsetting and renaming column names
Clean_data <- subset(SVI_Data, ST_ABBR == "CA")
CA_SVI <- subset(Clean_data, select = c(STATE,STCNTY,COUNTY,FIPS,EP_POV150,EP_LIMENG,EP_AGE65, EP_MINRTY, EP_UNEMP, EP_NOHSDP,EP_AGE17,EP_DISABL,EP_SNGPNT,EP_MUNIT,EP_MOBILE,EP_CROWD,EP_NOVEH,EP_GROUPQ,RPL_THEMES) )
# Renaming columns
colnames(CA_SVI)[colnames(CA_SVI) == 'STCNTY'] <- 'county_fips'
us_cases <- rename(us_cases, confirmed_cases = confirmed)
# Filter out rows with any value equal to -999
CA_SVI <- CA_SVI[rowSums(CA_SVI == -999, na.rm = TRUE) == 0, ]
# Drop rows with missing values
CA_SVI <- na.omit(CA_SVI)
covid19_cases<-us_cases
covid19_deaths<-us_deaths
covid19_cases<-subset(covid19_cases, state_name=='CA')
covid19_deaths<-subset(covid19_deaths, state_name=='CA')
head(covid19_cases)
## X county_fips county_name state_name state_fips date
## 35909 35908 0 Statewide Unallocated CA 6 2020-01-22
## 35910 35909 0 Statewide Unallocated CA 6 2020-01-23
## 35911 35910 0 Statewide Unallocated CA 6 2020-01-24
## 35912 35911 0 Statewide Unallocated CA 6 2020-01-25
## 35913 35912 0 Statewide Unallocated CA 6 2020-01-26
## 35914 35913 0 Statewide Unallocated CA 6 2020-01-27
## confirmed_cases lat long geometry
## 35909 0 NA NA
## 35910 0 NA NA
## 35911 0 NA NA
## 35912 0 NA NA
## 35913 0 NA NA
## 35914 0 NA NA
head(covid19_deaths)
## X county_fips county_name state_name state_fips date
## 35909 35908 0 Statewide Unallocated CA 6 2020-01-22
## 35910 35909 0 Statewide Unallocated CA 6 2020-01-23
## 35911 35910 0 Statewide Unallocated CA 6 2020-01-24
## 35912 35911 0 Statewide Unallocated CA 6 2020-01-25
## 35913 35912 0 Statewide Unallocated CA 6 2020-01-26
## 35914 35913 0 Statewide Unallocated CA 6 2020-01-27
## deaths lat long geometry
## 35909 0 NA NA
## 35910 0 NA NA
## 35911 0 NA NA
## 35912 0 NA NA
## 35913 0 NA NA
## 35914 0 NA NA
Merging covid-19 cases and deaths
merged_data <- merge(covid19_cases, covid19_deaths, by = "county_fips")
# Load the dplyr package
library(dplyr)
# Calculate total cases and deaths for each county
covid19 <- merged_data %>%
group_by(county_fips) %>%
mutate(total_cases = sum(confirmed_cases),
total_deaths = sum(deaths)) %>%
ungroup()
head(covid19)
## # A tibble: 6 × 21
## county_fips X.x county_name.x state_name.x state_fips.x date.x
## <int> <int> <chr> <chr> <int> <chr>
## 1 0 35908 Statewide Unallocated CA 6 2020-01-22
## 2 0 35908 Statewide Unallocated CA 6 2020-01-22
## 3 0 35908 Statewide Unallocated CA 6 2020-01-22
## 4 0 35908 Statewide Unallocated CA 6 2020-01-22
## 5 0 35908 Statewide Unallocated CA 6 2020-01-22
## 6 0 35908 Statewide Unallocated CA 6 2020-01-22
## # ℹ 15 more variables: confirmed_cases <int>, lat.x <dbl>, long.x <dbl>,
## # geometry.x <chr>, X.y <int>, county_name.y <chr>, state_name.y <chr>,
## # state_fips.y <int>, date.y <chr>, deaths <int>, lat.y <dbl>, long.y <dbl>,
## # geometry.y <chr>, total_cases <int>, total_deaths <int>
Calculating total deaths and cases for each county
# Calculate total cases and deaths for each county
covid19 <- covid19 %>%
group_by(county_fips) %>%
mutate(total_cases = sum(confirmed_cases),
total_deaths = sum(deaths)) %>%
ungroup()
head(covid19)
## # A tibble: 6 × 21
## county_fips X.x county_name.x state_name.x state_fips.x date.x
## <int> <int> <chr> <chr> <int> <chr>
## 1 0 35908 Statewide Unallocated CA 6 2020-01-22
## 2 0 35908 Statewide Unallocated CA 6 2020-01-22
## 3 0 35908 Statewide Unallocated CA 6 2020-01-22
## 4 0 35908 Statewide Unallocated CA 6 2020-01-22
## 5 0 35908 Statewide Unallocated CA 6 2020-01-22
## 6 0 35908 Statewide Unallocated CA 6 2020-01-22
## # ℹ 15 more variables: confirmed_cases <int>, lat.x <dbl>, long.x <dbl>,
## # geometry.x <chr>, X.y <int>, county_name.y <chr>, state_name.y <chr>,
## # state_fips.y <int>, date.y <chr>, deaths <int>, lat.y <dbl>, long.y <dbl>,
## # geometry.y <chr>, total_cases <int>, total_deaths <int>
Creating a new dataframe with total cases and deaths for each county
# Create a new dataframe with total cases and deaths for each county
summary_data <- covid19 %>%
group_by(county_fips) %>%
summarize(total_cases = sum(confirmed_cases),
total_deaths = sum(deaths))
# Display the new dataframe
print(summary_data)
## # A tibble: 60 × 3
## county_fips total_cases total_deaths
## <int> <int> <int>
## 1 0 6016 0
## 2 6000 564564 0
## 3 6001 87753324 2084168
## 4 6003 43428 0
## 5 6005 411344 0
## 6 6007 3079064 27260
## 7 6009 631680 1316
## 8 6011 876644 2444
## 9 6013 47864988 1044716
## 10 6015 706128 0
## # ℹ 50 more rows
Merging SVI data and covid-19 data using county_fips
#merging svi and summary_data
svi_covid <- merge(CA_SVI, summary_data, by = "county_fips", all.x = TRUE)
head(svi_covid)
## county_fips STATE COUNTY FIPS EP_POV150 EP_LIMENG EP_AGE65
## 1 6001 California Alameda 6001400100 6.8 0.0 25.3
## 2 6001 California Alameda 6001400200 7.0 1.3 21.7
## 3 6001 California Alameda 6001400300 8.6 0.3 16.4
## 4 6001 California Alameda 6001400400 12.0 1.3 10.8
## 5 6001 California Alameda 6001400500 12.8 1.1 15.1
## 6 6001 California Alameda 6001400600 10.7 0.9 10.5
## EP_MINRTY EP_UNEMP EP_NOHSDP EP_AGE17 EP_DISABL EP_SNGPNT EP_MUNIT EP_MOBILE
## 1 27.3 1.0 2.7 18.6 11.8 6.2 0.0 0.0
## 2 32.8 7.9 1.1 16.3 8.9 2.7 12.9 0.0
## 3 39.0 3.4 4.5 17.0 11.4 6.3 24.2 0.4
## 4 35.5 2.5 4.3 21.1 8.3 5.6 7.8 0.0
## 5 55.3 6.4 2.6 11.5 6.1 2.0 12.1 0.0
## 6 56.1 5.6 2.4 15.3 7.9 5.8 5.5 0.0
## EP_CROWD EP_NOVEH EP_GROUPQ RPL_THEMES total_cases total_deaths
## 1 0.0 4.4 0.0 0.0274 87753324 2084168
## 2 0.7 10.4 4.9 0.2744 87753324 2084168
## 3 2.1 15.0 0.8 0.4551 87753324 2084168
## 4 0.8 9.5 1.1 0.2426 87753324 2084168
## 5 2.6 7.8 0.0 0.2332 87753324 2084168
## 6 0.0 7.5 0.5 0.1700 87753324 2084168
cleaning econimc data
#clean data for economic
econ <- subset(econ, select = c(GEO_ID, NAME,B19013_001E))
econ <- econ[-c(1, 2), ]
names_to_change <- c("GEO_ID", "NAME", "B19013_001E")
new_names <- c("GEO_ID", "tract", "income")
econ <- setNames(econ, new_names)
econ$GEO_ID <- sub(".*US0*(\\d+)", "\\1", econ$GEO_ID)
svi_econ <- merge(econ, CA_SVI, by.x = "GEO_ID", by.y = "FIPS", all.x = TRUE, all.y = TRUE)
total_na_count <- sum(is.na(svi_econ))
print(total_na_count)
## [1] 1530
#remove NA
svi_econ <- na.omit(svi_econ)
svi_econ <- svi_econ[svi_econ$income != "-", , drop = FALSE]
#leaves us with 9001 rows
str(svi_econ$income)
## chr [1:9001] "220921" "200192" "118695" "137067" "110052" "135682" "94700" ...
svi_econ$income <- as.numeric(as.character(svi_econ$income))
total_na_count <- sum(is.na(svi_econ))
print(total_na_count)
## [1] 45
svi_econ <- na.omit(svi_econ)
#8956
merged_data1 <- svi_econ %>%
left_join(summary_data, by = c("county_fips" = "county_fips"))
head(merged_data1)
## GEO_ID tract income STATE
## 1 6001400100 Census Tract 4001, Alameda County, California 220921 California
## 2 6001400200 Census Tract 4002, Alameda County, California 200192 California
## 3 6001400300 Census Tract 4003, Alameda County, California 118695 California
## 4 6001400400 Census Tract 4004, Alameda County, California 137067 California
## 5 6001400500 Census Tract 4005, Alameda County, California 110052 California
## 6 6001400600 Census Tract 4006, Alameda County, California 135682 California
## county_fips COUNTY EP_POV150 EP_LIMENG EP_AGE65 EP_MINRTY EP_UNEMP EP_NOHSDP
## 1 6001 Alameda 6.8 0.0 25.3 27.3 1.0 2.7
## 2 6001 Alameda 7.0 1.3 21.7 32.8 7.9 1.1
## 3 6001 Alameda 8.6 0.3 16.4 39.0 3.4 4.5
## 4 6001 Alameda 12.0 1.3 10.8 35.5 2.5 4.3
## 5 6001 Alameda 12.8 1.1 15.1 55.3 6.4 2.6
## 6 6001 Alameda 10.7 0.9 10.5 56.1 5.6 2.4
## EP_AGE17 EP_DISABL EP_SNGPNT EP_MUNIT EP_MOBILE EP_CROWD EP_NOVEH EP_GROUPQ
## 1 18.6 11.8 6.2 0.0 0.0 0.0 4.4 0.0
## 2 16.3 8.9 2.7 12.9 0.0 0.7 10.4 4.9
## 3 17.0 11.4 6.3 24.2 0.4 2.1 15.0 0.8
## 4 21.1 8.3 5.6 7.8 0.0 0.8 9.5 1.1
## 5 11.5 6.1 2.0 12.1 0.0 2.6 7.8 0.0
## 6 15.3 7.9 5.8 5.5 0.0 0.0 7.5 0.5
## RPL_THEMES total_cases total_deaths
## 1 0.0274 87753324 2084168
## 2 0.2744 87753324 2084168
## 3 0.4551 87753324 2084168
## 4 0.2426 87753324 2084168
## 5 0.2332 87753324 2084168
## 6 0.1700 87753324 2084168
svi_econ_covid=merged_data1
Creating a new column that has income range
# Create a new column 'income_r' based on specified income thresholds
svi_econ_covid$income_r <- cut(svi_econ_covid$income,
breaks = c(-Inf, 50000, 100000, 200000, Inf),
labels = c("income<50000", "50000<income<100000", "100000<income<200000", "income>200000"),
include.lowest = TRUE)
# View the updated dataset
head(svi_econ_covid)
## GEO_ID tract income STATE
## 1 6001400100 Census Tract 4001, Alameda County, California 220921 California
## 2 6001400200 Census Tract 4002, Alameda County, California 200192 California
## 3 6001400300 Census Tract 4003, Alameda County, California 118695 California
## 4 6001400400 Census Tract 4004, Alameda County, California 137067 California
## 5 6001400500 Census Tract 4005, Alameda County, California 110052 California
## 6 6001400600 Census Tract 4006, Alameda County, California 135682 California
## county_fips COUNTY EP_POV150 EP_LIMENG EP_AGE65 EP_MINRTY EP_UNEMP EP_NOHSDP
## 1 6001 Alameda 6.8 0.0 25.3 27.3 1.0 2.7
## 2 6001 Alameda 7.0 1.3 21.7 32.8 7.9 1.1
## 3 6001 Alameda 8.6 0.3 16.4 39.0 3.4 4.5
## 4 6001 Alameda 12.0 1.3 10.8 35.5 2.5 4.3
## 5 6001 Alameda 12.8 1.1 15.1 55.3 6.4 2.6
## 6 6001 Alameda 10.7 0.9 10.5 56.1 5.6 2.4
## EP_AGE17 EP_DISABL EP_SNGPNT EP_MUNIT EP_MOBILE EP_CROWD EP_NOVEH EP_GROUPQ
## 1 18.6 11.8 6.2 0.0 0.0 0.0 4.4 0.0
## 2 16.3 8.9 2.7 12.9 0.0 0.7 10.4 4.9
## 3 17.0 11.4 6.3 24.2 0.4 2.1 15.0 0.8
## 4 21.1 8.3 5.6 7.8 0.0 0.8 9.5 1.1
## 5 11.5 6.1 2.0 12.1 0.0 2.6 7.8 0.0
## 6 15.3 7.9 5.8 5.5 0.0 0.0 7.5 0.5
## RPL_THEMES total_cases total_deaths income_r
## 1 0.0274 87753324 2084168 income>200000
## 2 0.2744 87753324 2084168 income>200000
## 3 0.4551 87753324 2084168 100000<income<200000
## 4 0.2426 87753324 2084168 100000<income<200000
## 5 0.2332 87753324 2084168 100000<income<200000
## 6 0.1700 87753324 2084168 100000<income<200000
Decision tree using SVI, Economy and Covid-19 datasets
#Decision tree using income, covid and other variables
library(rpart)
library(rpart.plot)
# Set a seed for reproducibility
set.seed(123)
# Split the data into a training set (80%) and a test set (20%)
splitIndex <- createDataPartition(svi_econ_covid$RPL_THEMES, p = 0.8, list = FALSE)
train_data <- svi_econ_covid[splitIndex, ]
test_data <- svi_econ_covid[-splitIndex, ]
#decision tree model
tree_model <- rpart(RPL_THEMES ~ total_cases + total_deaths + income_r + EP_AGE65 + EP_UNEMP+ EP_NOHSDP +EP_POV150, data = train_data)
# Make predictions on the test set
tree_predictions <- predict(tree_model, newdata = test_data)
# RMSE for the decision tree
tree_rmse <- sqrt(mean((test_data$RPL_THEMES - tree_predictions)^2))
print(paste("Decision Tree RMSE:", tree_rmse))
## [1] "Decision Tree RMSE: 0.138598526752175"
library("rattle") # For fancyRpartPlot (Trees) Answer "no" on installing from binary source
fancyRpartPlot(tree_model)
The regression tree model tries to delve into the intricate relationships between various factors and the Social Vulnerability Index (SVI), represented by the ‘RPL_THEMES.’ and other variables from covid and Economic datasets. The decision tree, rooted in the EP_NOHSDP variable, provides valuable insights into the nuanced predictors of SVI.
The tree distinctly identifies EP_NOHSDP as a pivotal factor in the initial split, emphasizing the critical role of educational attainment in predicting SVI. Further branching into EP_POV150 and income_r delineates the socio-economic dimension of vulnerability. Notably, individuals with lower EP_NOHSDP scores and lower EP_POV150 scores are predicted to have lower SVI values, indicating a correlation between educational attainment, poverty levels, and social vulnerability
Interestingly, the COVID-19-related variables, total_cases and total_deaths, do not prominently feature in the decision tree. This may suggest that, within the scope of our model, these variables may not be the primary drivers of social vulnerability when considered alongside educational attainment, poverty, and income. The prominence of income_r in the tree underscores the multifaceted nature of vulnerability, where refined income measures contribute significantly to the prediction of SVI.
The decision tree’s predictive performance, as measured by the Root Mean Squared Error (RMSE), is favorable, yielding an RMSE of 0.1386. This suggests that the model’s predictions align closely with actual SVI values. The interpretability of the decision tree provides actionable insights, revealing the intricate interplay between socio-economic factors and vulnerability, allowing for targeted interventions and informed decision-making in efforts to address and mitigate social vulnerability.
This decision tree provides a visually interpretable framework for understanding the complex interplay of factors influencing social vulnerability. As we present these findings, the nuanced relationships highlighted by the tree offer actionable insights for targeted interventions and policy measures, particularly focusing on education, income, and poverty alleviation in the context of social vulnerability.
Random Forest Model to identify significance of variables 3 datasets
#significant variables for predicting Total SVI (RPL_THEMES
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
# Model
rf_model <- randomForest(RPL_THEMES ~ ., data = svi_econ_covid)
# feature importance
print(rf_model$importance)
## IncNodePurity
## GEO_ID 2.014482
## tract 2.521926
## income 110.422326
## STATE 0.000000
## county_fips 1.048596
## COUNTY 1.037469
## EP_POV150 153.409257
## EP_LIMENG 37.025773
## EP_AGE65 3.499091
## EP_MINRTY 18.348359
## EP_UNEMP 9.061591
## EP_NOHSDP 176.855275
## EP_AGE17 4.863257
## EP_DISABL 10.220366
## EP_SNGPNT 17.229356
## EP_MUNIT 11.194503
## EP_MOBILE 6.809919
## EP_CROWD 57.304590
## EP_NOVEH 18.076946
## EP_GROUPQ 14.023930
## total_cases 1.323994
## total_deaths 1.305790
## income_r 62.531463
The Random Forest model tries to discern the crucial features influencing the Social Vulnerability Index (SVI). The model identified key predictors by evaluating their impact on the prediction accuracy and purity of decision tree nodes. Notably, the most influential features include ‘EP_NOHSDP,’ representing the percentage of individuals without a high school diploma, and ‘EP_POV150,’ indicating the percentage of individuals living below 150% of the poverty line. These socio-economic factors prominently contribute to SVI, underscoring the importance of educational attainment and economic well-being in vulnerability assessment.
Additionally, ‘income_r,’ a refined measure of income, emerged as a significant predictor, highlighting the nuanced role of income levels in shaping social vulnerability. The ‘EP_CROWD’ feature, capturing the percentage of households with crowded living conditions, signifies the impact of housing density on vulnerability. Furthermore, ‘EP_LIMENG,’ representing the percentage of individuals with limited English proficiency, and ‘EP_UNEMP,’ indicating the percentage of unemployed individuals, underscore the importance of language and employment status in vulnerability modeling
Our analysis extends beyond traditional demographic factors, incorporating ‘total_cases’ and ‘total_deaths’ as indicators of the COVID-19 impact. These variables dosen’t have much significance in predicting the total SVI. As we present our findings, these identified features provide actionable insights for policymakers and practitioners seeking to address and mitigate social vulnerability in diverse communities
Conclusion Covid:
Our comprehensive examination of the Social Vulnerability Index (SVI) in relation to COVID-19 data has yielded nuanced insights into the relationship between these two critical domains. The initial exploratory data analysis (EDA) revealed intriguing correlations within the dataset, suggesting complex interactions between SVI factors and COVID-19 metrics. However, the subsequent modeling phase provided a more in-depth perspective, indicating that there is no direct linear association between COVID-19 metrics (both confirmed cases and death rates) and the SVI.
This lack of a straightforward predictive relationship between COVID-19 impact and SVI highlights the complexity of social vulnerability. It suggests that while COVID-19 has been a significant public health challenge, the factors contributing to social vulnerability are diverse and cannot be solely predicted based on the incidence or severity of the pandemic.
This revelation is instrumental for policymakers and public health officials, as it guides the focus towards a more holistic understanding of vulnerability, encompassing a wider range of social, economic, and health-related factors.